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The  automatic  coordination  of  low-level  parallel  operations 
is  analyzed  in  the  context  of  microprogramming.  The  system 
structure  of  an  adaptive  microprogrammable  control  unit  (adaptive 
processor)  that  dynamically  composes  horizontal  microinstructions 
from  a  seguential  stream  of  microoperations  is  described.  The 
adaptive  processor  organization  is  designed  to  react  guickly  to 
microoperation  stream  changes  due  to  branches  by  using  look-ahead 
techniques.  A  converger  mechanism  that  buffers  several 
prefetched  microprogram  block  levels  is  combined  with  a 
conditioning  scheme  to  allow  the  early  resolution  of  block 
predicates  and  provide  a  smooth  flow  of  microoperations  to  the 
control  unit.  The  conditions  for  microoperation  execution 
initiation  for  several  adaptive  processor  variations  are  derived. 
These  variations  include  the  effects  of  buffering  source  and 
result  operands  as  well  as  the  effects  of  operand  forwarding, 

Analytical  performance  models  for  basic  processors  are 
presented.  A  basic  processor  is  a  single- instruction  stream, 
single-data  stream,  multiple  function  unit  processor  permitting 
only  primitive  operations,  operands,  and  addressing  modes.  A 
processor  is  described  by  a  control  policy  that  specifies  the 
flow  and  execution  of  microoperations  through  the  processor  and 
by  a  machine  configuration  that  specifies  the  parameters  of  the 
function  uni^s.  Program  characteristics  are  described  by  a 
program  structure.  The  structure  of  a  model  is  determined  by  the 
control  policy.  The  model  relates  the  performance  (instruction- 
execution  rate)  to  the  machine  and  program  parameters.  These 
models  have  the  capability  to  model  horizontal  microprogram!!  able 
control  units  and  the  effects  of  source  and  result  buffering  and 
operand  forwarding.  Several  models  are  derived  for  various 
microprogrammable  and  adaptive  processors. 
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CHAPTrJR  1 

INTRODUCTION 

This  thesis  considers  two  aspects  of  central  processing  units 
-  their  hardware  organization  and  their  analytical  modeling.  The 
research  has  been  motivated  by  problems  in  the  field  of 
microprogramming.  Microprogramming  is  a  technique  for 
coordinating  the  low-level  computational  resources  of  a  computer 
to  perform  some  desired  transformation  on  data  or  to  effect  some 
desired  sequence  of  control.  There  are  several  important  areas 
in  which  microprogramming  is  deployed.  Among  these, 
microprogramming  is  used  to 

1.  implement  machine  instruction  sets, 

2.  provide  diagnostics  to  locate  machine  defects, 

3.  provide  an  additional  level  of  security  to  some  segment  of 
important  code, 

4.  speed-up  the  execution  of  some  highly  utilized  routines. 

This  thesis  will  address  the  difficulties  associated  with  the 
fourth  area,  or  microprogram  optimization.  ' 

In  the  area  of  microprogram  optimization,  it  is  well  known 
that  the  computational  complexities  of  two  major  subcomponents, 
processor  scheduling  and  register  allocation,  are  polynomial 
complete  [  DLL73 ,  SET73  ].  Consequently,  one  must  be  content  with 
heuristics  to  perform  optimizations  on  a  significant  fraction  of 
all  microprograms.  A  second  difficulty  is  that  microprbg  ramtn  able 
structures  are  many  and  varied.  This  requires  a  replication  of 
effort  to  produce  commonly  used  algorithms  for  each  different 
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microprog rammable  structure.  Transportability  is  not  a  property 
of  microcode.  A  third  difficulty  is  found  with  the  elegance  and 
simplicity  that  microprogramming  accords  to  control  units  of 
computers.  The  centralization  of  control  functions  forces  a 
serialization  of  the  phases  in  the  execution  of  microprograms. 
Also,  microinstruction  formats  force  a  rigid  structuring  of  the 
computation.  This  structuring  may  not  be  the  best  for  the 
particular  computation. 

This  thesis  examines  these  difficulties  in  the  light  of 
today's  technological  achievements  in  semiconductors,  notably 
large  scale  integration.  The  hardware  used  in  computers  today  is 
increasing  i.n  sophistication  and  decreasing  in  cosf.  Hardware 
that  was  previously  in  the  domain  of  supercomputers  is  quickly 
becoming  common  in  minicomputers.  Minicomputers  can  be  bought 
with  cache,  pipelined  organizations,  virtual  storage,  and  array 
processing  capabilities.  This  trend  shows  every  sign  of 
continuing . 

We  propose  to  transfer  some  of  the  effort  of  microcode 
generation  onto  an  adaptive  micropr ogrammable  control  unit,  or  as 
it  will  be  called,  an  adaptive  processor.  Its  organization  is 
suggested  from  decompositions  of  the  mi croo pera tion 
interpretation  process.  The  adaptive  processor  is  a  method  of 
detecting  and  exploiting  parallelism  in  low-level  computations, 
and  coordinating  the  computational  resources  to  execute  the 
computation.  Basically,  the  adaptive  processor  examines  the 
instruction  stream  and  vectorizes  a  set  of  instructions  that  can 
be  executed  concurrently.  It  is  similar  in  concept  to  a  data- 
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flow  processor  [ RDM77 ,MIS78  ]•  One  significant  difference  is  that 
the  adaptive  processor  reacts  very  quickly  to  changes  in  stream 
direction.  As  such#  it  can  cope  with  instruction  streams  having 
a  high  incidence  of  branches.  Consequently,  the  techniques  used 
in  the  adaptive  processor  will  be  of  interest  to  designers  of 
pipelined  processors. 

Several  adaptive  processor  organizations  are  described  at  the 
system  level  using  hardware  organizations  of  varying  degrees  of 
sophistication.  For  these  organizations,  we  extract  control 
policies  that  regulate  the  actions  in  the  interpretation  process. 
These  specify  hardware  and  logical  conditions  that  must  be  met 
for  the  safe  execution  of  a  microoperation. 

The  second  aspect  of  this  thesis  is  performance  evaluation  of 
central  processing  units.  This  thesis  describes  an  analytical 
model  that  computes  the  instruction  execution  rate,  or  lER ,  of 
single  instruction  stream  multiple  function  unit  processors.  The 
motivation  for  such  models  is  to  quantify  comparisons  between 
different  processor  designs  executing  a  given  program  mix.  In 
our  particular  case,  a  method  that  allows  us  to  compare  the 
performance  of  conventional  micropr ogra mmab le  processors  with 
adaptive  processors  is  described. 

The  model  described  in  this  thesis  is  a  refinement  of  the 
model  described  by  Bowra  and  Torng  [BT76].  The  model  parameters 
are  specified  by  a  machine  configuration  that  provides  hardware 

t 

information  about  the  operation  unit  and  a  control  policy  that 
describes  the  regulatory  actions  of  the  control  unit.  A  program 
structure  describes  program  characteristics. 


The  additional 
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features  our  model  includes  are  the  effects  of  the  control  policy 
on  the  supply  of  microoperations,  the  effects  of  source 
buffering,  result  buffering  and  operand  forwarding.  This  will 
allow  the  modeling  of  a  wider  class  of  processors  that  use 
lookahead  techniques, 

^  ^  Outline 

Chapter  2  provides  a  background  for  microprogramming 
difficulties  and  describes  previous  research.  Central  processing 
unit  modeling  techniques  are  also  described. 

Chapter  3  describes  the  decomposition  of  microoperation 
interpretation  processes  and  describes  adaptive  processor 
organizations  and  their  control  policies. 

Chapter  4  describes  the  modeling  system  and  develops  models 
for  some  of  , the  processors  described  in  chapter  3. 

Chapter  5  summarizes  the  thesis  contributions  and  gives 
suggestion  for  extensions  and  future  research. 
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CHAPTER  2 

BACKGROUND  AND  DEFINITIONS 

This  chapter  introduces  the  basic  concepts  and  definitions 
central  to  this  thesis.  Microprogrammed  computer  organizations/ 
their  advantages  and  disadvantages  are  discussed.  Microcode 
generation  is  briefly  described,  concentrating  on  previous 
research  in  the  microprogram  optimization  area.  To  overcome  some 
of  the  difficulties  found  with  microprogramming,  the  adaptive 
processor  is  proposed.  It  composes  microinstructions 
dynamically,  providing  certain  advantages  in  exchange  for 
hardware  cost.  Finally,  to  permit  an  evaluation  of  certain 
hardware  organizations,  analytical  central  processing  unit 
performance  modeling  is  suggested  and  introduced. 

2. 1  The  Central  Processing  Unit 

In  this  section,  we  provide  an  overview  of  the  concept  of  a 
central  processing  unit  and  its  main  constituents,  the  control 
unit  and  the  operation  unit.  We  will  also  provide  some  working 
definitions  that  will  be  used  in  this  thesis. 

The  abstraction  of  a  central  processing  unit  of  a  computing 
system  is  shown  in  figure  2.1.  Its  functional  properties  have 
been  well  documented  in  the  literature  and  '  textbooks 
[BGV63,WS53,GLU65,HUS7C,BN71,EVZ78]  for  both  combinational  logic 
and  microprogrammed  control  units.  Briefly,  the  function  of  the 
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Figure  2.1  Central  Processing  Unit 
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control  unit  is  to  fetch  and  decode  each  instruction  tra 
from  memory  and  to  provide  a  sequence  of  directive 
operation  unit  to  control  its  execution.  The  operati 
performs  the  specified  operations  on  the  data.  From  a  g 
of  input  data  and  a  program  in  memory/  the  control  u 
operation  unit  combine  to  produce  output  data.  This  the 
examine  the  details  of  the  interaction  within  central  pr 
units  of  different  structures  as  they  execute  a 
Microprogrammed  central  processing  units  and  central  pr 
units  that  have  the  ability  to  execute  instructions  cone 
will  be  of  primary  interest. 
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2.1.1  Th e  Operation  On it 

The  main  constituents  of  the  operation  unit  can  be  classified 
into  the  five  resource  sets  listed  below: 


F  - 

the 

set 

of 

R  - 

t  he 

set 

of 

work  registers 

B  - 

the 

set 

of 

interconnecting  links  or  buses 

S  - 

the 

set 

of 

C  - 

the 

set 

of 

Figure 

2.2 

illustrates  these  components  for  an 

The  work  registers  are  drawn  as  rectangles,  function  units  as 
squares,  and  buses  are  the  interconnecting  lines.  Control  points 
[FLY75]  occur  at  bus  intersections  to  activate  an  interconnection 
and  at  function  units  to  specify  an  operation.  Specific  status 


flags  are  associated  with  the  function  units  to  indicate  that  a 
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Figure  2.2  The  HP  21  MX  OPERATION  UNIT 
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relevant  condition  or  condition  code  has  arisen  during  the 
execution  of  an  operation  completed  by  the  function  unit. 

Work  registers  can  be  directly  connected  to  function  units 
through  control  points.  Work  registers  do  not  include  latches  or 
buffers  that  are  used  for  temporary  buffers  or  timing  holds,  nor 
the  ’general  purpose’  registers  residing  in  a  scratchpad  memory 
that  requires  an  access  cycle.  The  general  purpose  registers  are 
included  in  the  notion  of  a  function  unit.  The  notion  of  a 
function  unit  has  been  expanded  from  the  conventional  sense  of  a 
data  transformer.  The  set  of  function  units  will  include  any 
mechanism  that  transmits  data  to  or  from  work  registers  with 

L 

visible  latency.  Thus  main  memory,  scratchpad  memory,  and  I/O 
channels  can  be  considered  to  be  function  units  because  they  have 
visible  transmission  times. 


The  set  of  function  units,  F,  will  be  specified  by  an  F- 
vector,  F=  (x-j  ,  x^  , .  . .  ,  x^)  where  x^=the  number  of  function  units  of 
type  f  and  N  is  the  number  of  different  types.  The  degree  of  the 
operation  unit  or  of  the  set  F,  DEG  (F)  ,  is  defined  as 

DEG(F)  =  I  F|=  (2.  1) 

VM 

This  provides  a  rough  performance  indicator  of  the  operation  unit 
[FLY72],  giving  the  maximum  number  of  function  units  that  caa  be 
concurrently  in  execution.  Chapter  4.  of  this  thesis  will 
describe  more  elaborate  analytical  models  that  evaluate  the 
performance  of  multiple  function  unit  computers. 
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2.1.2  Th^  Control  Unit 

The  control  unit  has  three  broad  functions:  it  must  decode 
each  instruction;  it  must  provide  correct  sequences  of  orders  to 
the  operation  unit  to  guide  the  execution;  it  must  determine  the 
location  of  the  next  instruction  and  initiate  its  fetch.  These 
functions  are  implemented  by  a  decode  network  that  provides  the 
set  of  orders  and  by  a  system  clock  or  a  set  of  event  detectors 
that  determines  when  the  orders  are  active.  If  the  decode 
network  is  implemented  with  sequential  logic,  the  control  unit  is 
said  to  be  lia:i;^vi:red.  If  it  is  implemented  with  structured 
logic,  such  as  a  memory  array,  it  is  said  to  be  fiicr oproaram med. 

The  diagram  of  a  microprogrammed  control  unit  is  shown  in 
figure  2.3.  Control  information  for  the  operation  unit  is  stored 
in  Control  'Memory.  Control  Memory  may  have  various  functional 
properties.  These  include  read-only,  read/write,  or  associative 
memories  such  as  programmed  logic  arrays.  Usually,  the  speed  of 
Control  Memory  is  an  order  of  magnitude  faster  than  main  memory, 
making  it  an  expensive  resource  that  is  very  carefully  deployed. 

To  transmit  directives  to  the  operation  unit,  a  control  word 
is  loaded  into  the  Microinstruction  Register.  The  hits  of  the 
control  word,  which  are  usually  logically  grouped  into  fields  and 
encoded,  are  then  decoded  and  phased  by  the  system  clock  into 
control  signals.  These  control  signals  then  activate  various 
control  points  in  the  operation  unit.  On  the  basis  of  the  status 
flags  raised  during  the  course  of  these  activations,  a  system 
state  is  defined.  This  state,  along  with  information  in  the 
Microinstruction  Register  is  translated  into  a  Control  Memory 
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Figure  2.3 


Microprogrammed  Control  Unit 
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address,  and  a  new  word  of  Control  Memory  is  fetched.  In  a 
microprogrammed  central  processing  unit,  this  interaction  occurs 
once  each  basic  or  control  cycle. 

The  organization  of  the  control  unit  is  central  to  this 
thesis.  In  the  following  sections,  there  will  be  a  discussion  on 
some  of  the  structural  possibilities  for  microprog ramra abl e 
control  units  and  their  quantitative  evaluation. 

2.2  Microprogramming  Definitions 

As  noted  before,  the  control  unit  provides  control  pulses  to 
the  control  points  of  the  operation  unit.  In  the  original 
concept  of  microprogramming  [WS53],  a  microoperation  was  the 
effect  of  a  control  pulse  generated  by  ANDing  a  bit  obtained  from 
Control  Memory  and  a  timing  pulse  from  the  system  clock.  For  our 
needs,  this  is  too  fine  a  subdivision  of  control  and  will  be 
called  a  niicrgaction.  We  define  a  microope ration  m  to  be  a 
quadruple 

m  =  <OP,S,D,f> 

where  OP  is  the  operation  to  be  performed  on  the  source  valuas, 

S  is  an  ordered  set  of  source  specifications, 

D  is  an  ordered  set  of  destinations  to  which  the  results, 
are  to  be  transmitted,  and 

f  is  the  class  of  function  unit  that  executes  OP,  having  an 
ordered  set  of  inputs  and  ordered  set  of  outputs. 

The  execution  of  a  microoperation  is  the  sequence  of  microactions 

that 

(i)  connects  the  sources  of  S  to  the  corresponding  inputs  of 
an  f-type  function  unit. 
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(ii)  initiates  the  operation  specified  by  OP  on  the  inputs, 
a  nd 

(iii)  routes  the  operation  results  to  their  corresponding 
destinations  specified  by  D. 

We  will  assume  that  a  microoperation  m  can  be  executed  by 
only  one  type  of  function  unit.  Thus  the  set  of  microoperations, 
M,  defined  for  the  central  processing  unit  can  be  partitioned 
into  f-classes,  Mf: 


Mf  =  {m  I  men  and  m  is  executed  by  an  f-function  unit} 

We  also  assume  that  the  execution  time  of  a  microoperation  is 
constant  and  a  multiple  of  basic  cycles.  Thus  H  can  also  be 
partitioned  into  time  classes,  Mt 

Mt  =  {m  I  m  £  r  and  the  execution  time  of  m  is  t} 


A  microoperation  is  sa 
begins  to  operate  on  the 


said  to  be  terminated 


entered  into  the  specifie 
set  of  microoperations 
cycle.  An  execution  of 
executions  of  all  micro 
defined  to  be  a  segue 
micr  oope rations  is 
microoperations  having  th 


(i)  the  data  preced 
defines  a  directe 
(ii)  if  any  microope 
microoperations  i 


id  to  be  initiated  when  a  function  unit 
specified  inputs.  A.  microoperation  is 
or  executed  when  its  results  have  been 
d  destinations.  A  microinstruction  is  a 


that  is  initiated 

in  the 
\ 

same  control 

a  microinstruction 

I  is 

the  set 

of 

operations  contained 

in  I. 

Microcod  e 

is 

nee  of  microoperations. 

A  block 

of 

a  maximal  contiguous 

sequence 

of 

e  following  properties: 

ence  relation  of  the  mic rooperat ions 
d  acyclic  graph 

ration  in  the  block  is  executed  then  all 
n  the  block  are  executed. 
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A  block  is  basically  a  segment  of  microcode  having  a  single  entry 
point  and  a  single  exit  point.  An  execution  of  a  block  is  the 
set  of  executions  of  all  microoperations  contained  in  the  block. 

It  will  be  useful  to  distinguish  stream  control 
microoperations  from  others.  Stream  control  microoperations  have 
the  capability  to  alter  the  execution  sequence  of  blocks.  They 
include  conditional  and  unconditional  branches  and  End-of- 
Microprogram  microoperations  that  initiate  the  fetch  of  the  next 
microprogram.  Their  class  of  function  unit  will  be  1.  A 
conditional  branch  microoperation  B  will  be  denoted  by 

B  =  <0P, S,D, 1> 

where  OP  specifies  a  logical  operator, 

S  specifies  a  set  of  condition  code  values  (either  tT , 

F,  or  don  •  t  car  e)  , 
a  condition  code  source,  and 
a  branch  address  source. 

D  specifies  a  program  counter. 

The  execution  of  B  evaluates  a  predicate  defined  by  the 

relational  operation  on  the  specified  condition  code  and  the 
condition  code  value  of  the  specified  source.  If  the  result  is 
true  the  next  block  executed  is  the  one  whose  first 

micro  operation  is  located  at  the  branch  address.  If  the  result 
is  false,  the  block  immediately  succeeding  in  memory,  or  fall 
through  block,  is  executed.  We  will  assume  that  a  branch 
micro oper at  ion  has  at  most  two  possible  target  addresses. 

A  microprogram  is  a  sequence  of  one  or  more  blocks.  A 
execution  is  an  execution  of  the  blocks  contained  by 
the  microprogram,  starting  with  the  first  block.  A  change  in  the 


2-8 


execution  sequence  of  the  blocks  may  be  accomplished  by  the 
execution  of  a  branch  microoperation  in  a  block. 

For  a  given  microprogram  sequence,  we  will  use  its  sequential 
execu  tion  to  define  the  outcome  of  the  microprogram  execution. 
It  is  based  on  the  sequential  ^  block  which  is  the 

sequence  of  microoperation  executions  of  all  the  microoperations 
in  the  block  such  that  a  microoperation  execution  is  not 
initiated  until  the  execution  of  the  immediately  preceding 

microoperation  has  terminated.  Thus  the  sequential  execution  of 
2  microprogram  is  an  execution  in  which  each  block  is 
sequentially  executed  and  the  branch  microoperation  in  a  block 
that  contains  one,  occurs  at  the  end  of  the  block. 

In  this  thesis,  we  will  be  examining  hardware  organizations 
that  reorder  microoperations  during  execution  and  thus  do  not 
produce  sequential  executions  of  the  input  microprogram.  It  is 
imperative,  however,  that  the  resulting  execution  be  equivalent 
to  the  sequential  execution  of  the  microprogram.  The  following 
assumptions  and  definitions  will  clarify  the  notion  of 

equivalence. 

We  assume  that  the  microprogram  and  its  variable  data  are 
stored  in  distinct  memories,  program  memory  and  data  memory 
respectively,  and  that  the  microcode  is  not  modified  during 
execution.  The  cells  in  data  memory  are  defined  to  be  state 

variables.  A  machine  state  is  an  assignment  of  values  to  the 

state  variables.  The  execution  of  a  microoperation  changes  the 
machine  state  by  assigning  values  to  the  microoperation 
destinations.  The  set  of  input  variables  {Ii}  for  a  microprogram 
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is  the  set  of  state  variables  that  is  referenced  by  the 
microoperations  in  the  microprogram,  each  of  which  must  be 
assigned  values  before  their  first  use  in  the  microprogram 
execution.  The  set  of  output  variaWes  {0^ }  of  a  microprogram  is 
the  set  of  state  variables  whose  final  values  generated  by  the 
microprogram  execution  are  the  outputs  of  the  microprogram,  P.n 
input  state  I  is  defined  by  an  initial  assignment  of  values  to 
the  input  variables.  Similarly,  an  output  state  0  is  defined  by 
the  final  assignment  of  values  of  the  microprogram  execution  to 
the  output  variables. 


We  say  that  an  execution  of 
to  an  execution  E3  if  for  all  possi 
E3  have  the  same  output  state.  Thi 
based  on  the  view  of  transforming  i 
In  chapter  3  we  provide  a  wor 
micro  programmable  central  processin 
allow  us  to  determine  when  a  microo 
be  initiated  so  that  the  resulting 
sequential  execution. 


a  microprogram  is  ^ouivalent 
ble  input  states  I  both  E^  and 
s  definition  of  equivalence  is 
nput  states  to  output  states, 
king  definition  applicable  to 
g  units.  This  definition  will 
peration  of  a  microprogram  can 
execution  is  equivalent  to  the 


2.3  Generation  of  Microprograms 

In  this  section,  we  will  examine  some  of  the  opportunities 
and  problems  in  implementing  an  algorithm  in  microcode.  We  will 
be  interested  in  ‘optimized*  microcode  for  ‘relatively  complex* 
algorithms.  Because  the  terms  ‘optimized*  and  ‘relatively 
complex*  can  be  misinterpreted,  we  will  make  them  more  precise. 
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We  wish  to  optimize  microcode  in  the  sense  of  reducing  the 
execution  time  of  the  algorithm.  A  related  optimization,  that  of 
reducing  Control  Memory  width  and  size  [AGE76]  is  not  our  central 
goal  for  optimization.  Although  in  many  instances  the  goals  of 
these  differing  optimizations  will  be  compatible,  we  will 
concentrate  on  the  speed  enhancement  objective  in  optimized 
microcode . 

By  *  relatively  complex'  algorithms  we  wish  to  exclude  the 
microcode  generation  problem  of  a  hardware  designer  implementing 
the  usual  machine  instruction  set  [HDS70]  for  the  computers. 
Although  such  tasks  can  be  complex,  the  task  is  usually 
subdivided  into  generating  microcode  for  individual  machine 
instructions,  most  of  which  are  relatively  straightforward.  In 
addition,  the  hardware  designer  has  the  additional  freedom  to 
modify  the  hardware  to  obtain  a  speed  enhancement  or  to  simplify 
microcode.  We  however  take  the  viewpoint  of  a  user  presented 
with  a  micr oprogr ammable  machine  who  must  create  microcode  for  an 
algorithm.  The  nature,  utility,  and  length  of  his  algorithm 
makes  it  worthy  of  a  microprogram.  ' 


2.3.1  Microproqram  Structure 


W  e 


have  already  introduced  the  basic  components  of  a 


microprogram,  the  microoperation,  microinstruction,  and  block. 
We  now  examine  their  structure  and  how  they  are  used  in 
microprogrammable  central  processing  units. 


As  defined  earlier. 


microinstruction  is  a  set  of 


microoperations  contained  in  a  word  of  Control  Memory,  that  can 
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be  initiated  in  the  same  control  cycle-  Although  a 
microinstruction  can  specify  other  information,  such  as  immediate 
data,  our  interest  is  limited  to  microoperations- 
Micro  instructions  are  specified  by  a  microins  true  tior.  format  set 
that  defines  the  permissible  combinations  of  microoperations. 
The  decree  of  a  microinstruction  I,  DEG(I),  is  the  number  of 
micr ooper at ions  specified  by  the  micr oi nstr ucti on.  A 
micro instruction  format  is  said  to  be  maximal  if  it  can  specify  a 
microoperation  for  each  function  unit  in  the  operation  unit, 
i.e.,  if  it  can  specify  a  microinstruction  I,  such  that 
DEG ( I) =DEG ( F)  where  F  is  the  F-vector  of  the  operation  unit 
(section  2.1.1).  In  practice,  the  maximum  degree  of  a 
microinstruction  format  is  usually  less  than  the  maximum 
operation  unit  degree.  A  microinstruction  format  is  said  to  be 
horizontal  if  it  can  specify  the  initiation  of  more  than  one 
microoperation.  It  is  vertical  otherwise.  Most 

microprogrammable  central  processing  units  have  a  horizontal 
format.  However,  on  some,  most  formats  used  in  a  microprogram 
will  be  vertical.  The  B1700  [SAL76]  is  such  an  example. 

On  a  classical  microprogrammable  control  unit  the  block  is  a 
section  of  code  having  a  single  entry  and  in  which  a  branch 
microoperation,  if  any,  occurs  at  the  end.  Other  names  for  a 
block  that  have  been  used  in  the  literature  are  subblock  [RT74], 
straight  line  code  and  DAG  [AU72].  The  latter  is  an  acronym  for 
directed  acyclic  graph.  A  DAG,  D(M,E),  is  used  to  represent  the 
flow  or  the  data  dependencies  within  a  block.  M  is  the  set 
of  microoperations  within  the  block  and  E  is  the  set  of  directed 
edges  between  them.  The  incoming  edges  are  ordered  and  directed 
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from  source  nodes,  some  of  which  may  be  outside  the  block. 
Figure  2.4a  gives  the  listing  for  a  block  and  figure  2.4b  is  the 
corresponding  DAG  showing  the  data  flow. 

In  the  example  (which  does  not  describe  a  meaningful 
computation) ,  the  microoperations  have  been  represented 

They  have  not  been  bound  to  any  particular 
resources.  Such  representations  are  usually  the  starting  point 
for  a  microprogram  generation.  The  actual  code  is  generated  by 
combining  microoperations  that  are  concurrently  executable  into 
microinstructions  and  assigning  their  results  to  specific  work 
registers.  A  predicate  is  associated  with  the  branch 
microoperation  of  each  block  to  determine  the  successor  block  : n 
an  execution.  The  predicate  value  will  be  either  TRUE  (T)  or 
FALSE  (F)  .  If  the  block  has  no  branch  microoperation  or  if  the 
branch  is  unconditional,  a  value  of  T  is  assigned.  In  the 
example,  an  unconditional  branch  S  is  used  to  determine  the 
successor  block. 

As  can  be  seen  from  figure  2.4,  a  hierarchical  structure  of  a 
microprogram  is  being  developed.  The  next  higher  level  indicates 
the  relations  between  the  component  blocks  of  a  microprogram.  A 
micro  prog  ra  m  £ra£h  is  used  to  display  the  control  flow  in  a 
microprogram.  A  microprogram  graph,  M(B,T),  is  a  directed  graph 
where  B  is  the  set  of  blocks  and  T  is  the  set  of  permissible 
transitions  directed  from  a  predecessor  to  a  successor  block. 
The  edges  will  be  marked  by  an  execution  condition,  a  T  or  an  F, 
to  indicate  the  predicate  value  in  the  predecessor  node  that  must 
be  satisfied  to  permit  the  entry  into  the  successor. 
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Optimizations  of  microcode  are  grouped  into  two  categories  - 
local  and  global.  Local  2Etimiza tions  are  performed  on  the 
microcode  that  is  contained  within  a  single  block.  Global 
optimizations  on  the  other  hand  perform  optimizations  across 
blocks  on  the  whole  microprogram.  The  previous  research  in 
microprogram  optimizations  is  described  below. 


2.3.2  The  Basis  of  Microcode  Optimizations 


Although  the  extent  of  ■  optimiza 
directly  affected  by  the  algorithm 
computation,  this  will  not  be  an  issue 
a  compiler  has  performed  certain  glob 
RUD75  ]  on  an  algorithm  and  that  a 
sequence  of  some  intermediate  level  Ian 
this  point,  the  compiler  may  proce 
instruction  object  code  or  to  undertake 
phase.  Upon  deciding  to  generate  micr 
areas  in  which  performance  can  be  e 
execution  of  the  machine  instruction 
microcode.  The  first  optimization  is  m 
micro  programmable  central  processing 
remaining  three  are  machine  dependent. 

1 .  In  conventional  microprogram 
units,  machine  instructions 
interpretation  loop  consisting 
instruction  fetch,  decode,  o 
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instructions  into  microcode,  all  but  one  instruction  fetch 
phase  and  decode  phase  can  be  removed. 

2.  Some  operation  units  will  have  additional  registers 
available  to  microcode.  By  judiciously  assigning  operands 
to  these  registers,  time  savings  result  by  replacing  main 


memory  accesses  with  fas 
eliminating  memory  address 

3.  Some  microprogrammable 
multiway  branch  in  a  s 
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sequencing  blocks  according  to  their  expected  entrance 
frequencies.  Abd-Alla  and  Kaarlgard  [AK74  ]  make  use  of  all  the 
above  optimizations.  They  suggest  using  execution  traces  to 
isolate  segments  of  high  frequency  code  and  data  for  microcode 
imple  mentation. 


Much  of  the  literature  is  devoted  to  techniques  of  optimizing 
horizontal  microcode.  These  are  based  on  methods  that  produce  a 
minimum  time  schedule  for  a  set  of  processors  executing  a  set  of 
tasks  constrained  to  a  given  precedence  relation.  Yau  et  al 
[YST74]  give  a  depth  first  search  technique  to  find  the  minimum 
number  of  microinstructions  to  cover  a  given  directed  acyclic 
graph  of  microoperatior  s.  This  algorithm  is  optimal  if  all 
microoperations  require  the  same  execution  time.  They  also 
suggest  a  heuristic  technique  to  resolve  resource  conflicts 
between  microoperations.  The  heuristic  orders  microoperations 
according  to  the  number  of  successor  microoperations. 
Ramamoorthy  and  Tsuchiya  [RT74]  describe  a  high-level  language 
SIHPL  and  its  compiler  that  performs  suboptimal  optimization  on 
horizontal  microcode.  The  program  is  decomposed'  into  blocks, 
generating  earliest  time  and  latest  time  partitions  within  each 
block.  These  partitions  give,  respectively,  the  earliest  and 
latest  starting  times  for  each  microoperation  such  that  the  set 
of  microo  pera  tion  s  'will  be  executed  in  the  minimum  number  of 
steps  if  all  microoperations  are  initiated  between  these  times. 
Microoperations  on  the  critical  path  are  first  assigned  to 
microinstructions.  If  there  is  a  resource  conflict,  an  arbitrary 
critical  path  microoperation  is  assigned  and  a  new 
microinstruction  is  created  for  assigning  a  remaining  critical 
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path  micr ooperati on.  This  is  continued  until  all  critical  path 
microoperations  are  assigned.  Non-critical  path  microoperations 
are  then  assigned  to  these  microinstructions,  with  new 
microinstructions  being  created  to  resolve  conflicts.  Tsuchiya 
and  Gonzales  [TG76]  suggest  a  similar  technique.  They  resolve 
conflicts  by  giving  priority  to  the  microoperation  with  the  most 
successor  microoperations.  Dasgupta  and  Tartar  [DT76,DAS78]  give 
an  algorithm  that  produces  a  near-minimum  partition  of  the 
microoperations  in  a  block.  Microoperations  can  have  arbitrary 
execution  times.  However,  on  finite  resource  microprog ramm able 
central  processing  units  this  partition  may  not  be  directly 
applicable.  This  is  true  if  the  microoperations  in  a  class  of 
the  partition  cannot  all  be  accommodated  in  one  microinstruction. 
Kraska  [KEA69],  however,  does  provide  a  backtracking  algorithm 
for  finding  a  minimum  schedule  for  a  set  of  special  purpose 
function  units  operating  on  a  set  of  tasks  having  arbitrary 
execution  times  and  an  execution  order  that  is  constrained  by  the 
precedence  relations  of  a  directed  acyclic  graph.  This  algorithm 
can  solve  the  problem  of  finding  a  minimal  partition  of  the 
microoperations  in  a  block.  The  resulting  set  of 
microinstructions  will  have  a  minimized  execution  time.  Dewitt 
[DEW75]  observes  that  parallelism  is  highly  machine  dependent  and 
that  ultimately,  microinstruction  formats  determine  the  possible 
parallelism  for  a  given  m icropr ogramraable  central  processing 
unit . 


The  assignment  of  variables  to  registers  has  received  less 
attention.  Abd-Alla  and  Kaarlgard  [AK74]  examine  segments  of 
machine  instructions  for  replacement  by  a  microprogram  and  use 
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frequency  counts  to  assign  variables  to  registers  within  the 
segment.  Their  technique  uses  a  register  prologue  to  save  all 
registers  and  to  preload  the  most  frequently  used  variables. 
Upon  conclusion,  an  epilogue  stores  the  variables  and  restores 
the  registers  to  their  previous  states.  Despite  the  possible 
redundant  movement  between  registers  and  memories,  they  obtain 
excellent  results.  Techniques  used  at  the  intermediate  language 
level  [ FEE74 , BEA74 ,  KEN72]  could  be  applied  to  improve  register 
utilization.  Blain  et  al  [BPH74]  give  an  overview  of  a  general 
microprogram  development  system  that  pays  particular  attention  to 
the  location  of  variables  during  the  microcode  generation  phase. 
No  results,  however,  are  given. 

Baba  [EAB77]  describes  a  microprogram  generating  system,  MPG, 
that  can  generate  microcode  from  a  microprogram  written  in  a 
machine-independent  high  level  microprogramming  language.  MPG 
permits  the  generation  of  a  microprogram  compiler  for  any 
computer  described  by  a  machine  description  language.  The 
compiler  performs  register  allocations,  microinstruction 
compositions,  and  places  the  microinstructions  in  the  limited 
control  memory.  Por  a  set  of  four  microprograms  on  a  HITP-C6350, 
the  number  of  composed  microinstructions  was  within  25%  of  the 
number  of  microinstructions  generated  by  hand  assembly.  The 
average  processing  time  on  a  HITA.C8350  required  to  generate  a 
microinstruction  ranged  from  4.6  to  9.9  seconds.  Some  97%  of 
this  time  was  spent  in  either  the  optimization  phase  or  in  the 
control  memory  allocation  phase.  This  indicates  that  general 
solutions  to  generating  near  optimal  microcode  are  available. 
However,  the  processing  time  is  excessive. 
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2.4  M icro prog  ramming  Problems 
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te  the  many  advantages  of  microprogramming  [HUS70,  A.R76] 
xamples  of  dramatic  performance  gains  [WEB67,  fiK74],  the 
user  community  has  shown  reluctance  to  exploit 
.  Some  of  the  reasons  for  this  apparent  anomaly  will  be 

bly  the  most  difficult  hurdle  for  a  potential 
rammer  to  surmount  is  the  lack  of  high  level  software 
nd  educational  aids.  Usually  only  an  assembler,  circuit 
and  the  microcode  listing  to  the  manufacturer's  machine 
on  set  are  given  [R^*D75].  Consequently,  the 
rammer  must  immerse  himself  into  the  copious  detail  of 
ral  processing  unit  at  the  microprogram  level.  Ke  must 
ch'up  a  learning  curve  to  master  a  detailed  and  complex 
interrelationships  before  he  can  begin  to  generate 
microcode. 


Although  microprogramming  has  removed  the  ad  hoc  nature  of 
control  units,  it  has  not  prevented  the  design  of  '  unreasonable 
machines  from  a  programming  viewpoint.  This  situation  has 
prompted  Rosin  [ROS73]  to  suggest  that  microprogramming  should  be 
defined  as  "the  implementation  of  hopefully  reasonable  systems 
through  interpretation  on  unreasonable  machines".  He  recommends 
that  "a  portion  of  the  vast  amount  of  time  spent  on  coping  with 
unreasonable  machines  would  be  better  spent  on  other  tasks". 
Lehman  [LEH75]  cogently  argues  that  because  microcode  is  orders 
of  magnitude  more  parallel,  less  structured  and  less  intelligible 
than  assembly  language  programs,  a  wholesale  conversion  of 
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programs  to  microcode  would  be  dangerous.  He  states  that  there 
has  been  insufficient  research  to  determine  safely  manageable 
microcode  module  size,  interfacing  conventions  and  microcode 
certification  techniques,  Ontil  then,  large  segments  of 
microcode  will  prove  unreliable  and  unmaintainable.  Bauer's 
experience  in  generating  microcode  for  the  IBM  360/67 
corroborates  Posin's  and  Lehman's  views  [BAU75]. 
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tree  height  [KUC77],  These  are  some  of  the  tradeoffs  that  must 
be  examined  to  ensure  optimality.  Because  the  general  register 
assignment  and  processor  scheduling  problems  are  both  NP-complete 
[SET73,  ULL73],  a  joint  optimization  may  require  excessive 

computation  time.  Thus  for  pragmatic  reasons  compromises  are 
inevitable. 

A.n  empirical  characteristic  of  microcode  is  that  block  length 
is  relatively  short  [BBL73,  KW74,PS77].  A  further  inescapable 
fact  is  that  static  optimizations  are  effectively  limited  by 
block  interfaces  because  transitions  between  blocks  are  in 
general  statically  indeterminate.  Under  certain  conditions, 
however,  microopera^ i' ons  may  be  moved  through  block  interfaces. 
Dasgupta  [DAS77]  describes  an  algorithm  that  detects  parallel 
microoperations  in  microprograms  with  IF-THEN-ELSE  constructs. 
Conditions  for  moving  microoperations  into  and  around  blocks  are 
determined.  Short  block  lengths  tend  to  minimize  the 
computational  requirements  of  an  optimization  if  only  local 
optimizations  are  performed.  Although  compilation  techniques 
exist  for  maximizing  parallelism  and  minimizing  execution  time  of 
conditional  statements  and  loops  [KUC77,  LAM74  ],  their  excessive 
function  unit,  register,  and  Control  Memory  requirements  would 
rule  out  their  use  on  most  microprogrammable  trentral  processing 
units.  Such  techniques  would  be  used  on  vector  or  single¬ 
instruction  stream,  multi  pie- data  stream  (SIMD)  machines  such  as 
the  CRAY-1  [RL77,EUS78]. 
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2.5  Hardware  Restrictions  of  tlicroprogrammable  Computers 

As  mentioned  in  the  previous  section,  computers  at  the 
microcode  level  have  been  called  unreasonable  machines.  Part  of 
this  is  due  to  the  ease  of  design  of  a  microprog rammable  control 
unit,  allowing  the  creation  of  unique  and  highly  specialized 
computer  structures.  To  take  advantage  of  the  computational 
potential,  unique  microcode  must  be  generated  using 
individualized  optimization  strategies  that  are  not  sharable  with 
other  computer  structures.  For  a  sampling  of  the  range  of 
possibilities  and  a  discussion  of  microprog rammable  structures, 
the  following  references  are  recommended:  [FIUS70,  AF76,  SAL76  ]. 
A  prime  example  the  possible  diversity  of  microprogrammed 

structures  is  the  IBM  360  family  of  machines  [HUS70],  The 
members  that  are  microprogrammed  have  little  commonality  in  their 
microcode  although  the  machine  instruction  sets  may  be  identical. 

Despite  the  advertised  advantages  of  flexibility, 

adaptability,  and  extensibility  [HUS70],  classical 

microprog rammable  structures  have  three  characteristics  that 

\ 

impose  a  rigid  microprogram  execution  mode  and  limit  hardware 
extensibility.  These  are 

1.  the  centralization  of  major  control  functions  within  a 
single  Control  Memory  system, 

2.  the  synchronization  of  operation  unit  resources  to  the 
Control  Memory  cycle,  and 

3.  the  rigid  and  restrictive  binding  of  microoperations  to 
microinstruction  formats. 

The  most  serious  constraint  to  performance  is  the  integration 
of  major  control  functions  into  the  Control  Memory  system.  This 
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forces  a  serialization  of  machine  instruction  execution  and 
prevents  the  overlap  between  any  of  the  interpretation  loop 
phases  of  successive  machine  instructions.  For  example,  the 
overlap  of  the  instruction  fetch  and  execution  phase,  common  to 
many  hardwired  central  processing  units  is  not  possible  unless 
the  fetch  phase  microoperations  are  included  in  all  execution 
phase  microcode  segments.  McClure  [MCC71]  describes  methods  of 
distributing  microprogrammed  control  over  the  different  phases  of 
the  machine  instruction  interpretation  loop,  allowing  their 
overlap.  In  this  technique  of  distribution,  the  resources  of  the 
operation  unit  are  specialized  to  particular  phases.  Other 
methods  allowing  concurrency  at  the  machine  instruction  level  use 
identical  operation  units  for  the  different  phases,  multiplexing 
a  common  control  memory  [KV73,  EV73,  LEV73]. 


The  restrictive  binding  imposed  on  microoperations  by 
microinstruction  formats  may  limit  the  performance  potential  in 
two  ways,  called  degree  restriction  and  static  Ending.  Degree 
restriction  pertains  to  micr oprogrammable  central  processing 
units  whose  microinstruction  format  is  not  maximal.  '  In  practice, 
submaximal  format  sets  are  the  rule  for  two  reasons.  First, 
because  buses  are  expensive,  most  operation  units  have  limited 
busing  capacity  that  cannot  exploit  maximal  parallelism. 
Secondly,  Control  Memory  requirements  are  also  reduced  because 
the  number  of  unused  microoperation  fields  is  reduced. 
Consequently,  computations  that  present  sets  of  initiatable 
microoperations  whose  resource  requirements  are  satisfied  by  the 
operation  unit  but  cannot  be  initiated  due  to  format  restrictions 
will  require  additional  microinstructions  and  time  penalties  will 
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result.  Such  restrictions  are 
DeWitt  [DEW75].  The  severity  of 
program  dependent,  depending 
restricted  microoperation  sets  in 


reported  by  Bauer  [BA.U75]  and 
degree  restrictions  will  be 
on  the  incidence  of  format 
microprog  rams. 


Static  binding  is 


result  of  the  microinstruction 


composition  occurring  during  microcode  generation  time.  There 
are  two  distinct  cases.  The  first  has  been  briefly  described  in 
the  previous  section  and  is  a  result  of  the  static  indeterminacy 
of  subblock  transitions.  This  limits  the  composition  of  a 
microinstruction  to  microoperations  within  a  block.  The  second 
case  can  occur  when  the  operation  unit  resources  are  not  under 
the  sole  disposition  of  Control  Memory.  For  example,  a  direct 
memory  access  controller  (DMA)  can  request  memory  cycles 
independently  of  the  control  unit.  If  a  microinstruction  is  to 
intiate  a  memory  microoperation  while  a  DMA  request  is  being 
serviced,  the  microinstruction  is  forced  to  wait  until  memory 
becomes  available.  This  may  also  impose  a  needless  wait  on  any 
other  microoperation  contained  in  the  microinstruction. 
Similarly,  if  the  central  processing  unit  path  to'  main  memory  is 
cache  buffered,  delays  result  if  an  address  is  not  found  resident 
in  cache. 


The  synchronization  of  the  operation  unit  with  Control  Memory 
directly  transfers  any  access  delays  to  the  computation  time.  In 
the  case  where  Control  Memory  accesses  are  overlapped  with 
operation  unit  execution,  a  branch  out  of  sequence  may  delay  the 
access  of  the  next  control  word  until  the  target  address  is 


deter  mined 
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Hardware  extensibility  is  discouraged  by  the  effort  required 
to  integrate  the  controls  of  a  new  device  into  control  store  and 
by  the  possible  necessity  to  remicro prog ram  Control  Memory 
contents.  Adding  a  new  device  could  be  as  easy  as  sharing 
fields.  But  it  may  require  the  addition  of  a  new 
microinstruction  format  or  extending  Control  Memory  width,  along 
with  the  control  decoding  network.  Pe microprogram ming  will  be 
inevitable.  Similarly,  replacing  a  function  unit  by  an 
operationally  equivalent  but  faster  unit  may  require 
remicroprogramming  to  benefit  from  any  speedup. 

This  section  has  given  a  brief  description  of  some  of  the 
difficulties  with  a  classical  microprogrammed  control  structure. 
The  next  section  will  describe  some  hardware  enhancements  to  the 
control  structure  to  reduce  some  of  the  restrictive  effects. 

2.6  The  Adaptive  Processor 

This  section  will  present  an  overview  of  hardware  that 
relaxes  many  of  the  restrictions  and  constraints  inherent  to 
classical  microprog ramma bl e  structures.  Basically,  the  hardware 
uses  lookahead  on  a  sequential  stream  of  microoperations  and 
dynamically  composes  a  microinstruction  at  execution  time.  This 
is  accomplished  by  a  Stream  Controller  that  detects  executable 
microoperations  and  by  distributing  the  microinstruction  register 
control  fields  to  the  function  units  in  the  operation  unit. 
Figure  2.5  displays  the  diagram  of  an  adaptive  micropiog ramm able 
control  unit.  The  adaptive  m icropr ogrammable  processor  will  also 
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be  called  adaptive  processor,  or  adaptive  microprog  ramtn  abl  e 
control  unit. 

To  describe  the  operation  of  the  Stream  Controller  the 
concept  of  an  instruction  ^ream  [FLY72]  will  be  used.  h 
microoperation  stream  is  a  sequence  of  micr coperations  that  is 
presented  to  the  central  processing  unit  for  execution.  In  this 
case,  the  stream  emanates  from  Control  Memory,  micr ooperat ions 
being  fetched  in  control  words  of  contiguous  micrcope rati ons. 
The  micr cope  rations  are  not  bound  to  any  microinstruction  format 
but  are  passed  in  sequence  to  buffers  in  the  Converger,  and  then 
to  the  Issuer.  If  the  buffers  are  empty,  the  microoperation 
block  is  passed  directly  to  the  Issuer.  From  the  Issuer, 
microoperations  are  dispatched  to  appropriate  fields  within  the 
Control  Buffer  when  they  have  satisfied  the  issue  conditions. 
These  structures  will  be  discussed  in  the  next  chapter  along  with 
initiation  conditions  that  dynamically  determine  when  a 
microoperation  can  begin  execution.  In  this  manner,  the  adaptive 
microprograramable  control  unit  effectively  removes  the  static 
binding  restriction  of  microinstruction  formats. 

The  Stream  Controller  is  part  of  a  pipeline  through  which  the 
stream  must  pass.  As  such,  its  performance  is  sensitive  to  any 
disruptions  to  the  stream  that  may  be  caused  by  branches.  The 
Converger  reduces  the  disruption  by  examining  upstream 
microoperations.  If  an  unconditional  branch  is  detected, 
assuming  that  relative  addressing  is  used,  the  Converger  will  be 
able  to  provide  the  target  address  one  cycle  later.  This 
suspends  the  stream  from  Control  Memory  for  one  cycle.  The 
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Converger,  however,  buffers  the  stream  to  the  Issuer,  reducing 
the  disruption.  Conditional  branches  pose  a  more  serious  problem 
because  their  resolution  is  usually  dependent  upon  the  result  of 
some  downstream  microoperation.  Although  the  target  address  can 
also  be  provided  in  the  following  cycle,  the  issuing  of 
microoperations  upstream  from  the  branch  cannot  begin  until  the 
branch  predicate  has  been  evaluated.  The  severity  of  the 
disruption  will  depend  on  the  distance  between  the  branch  and  its 
predicate  dependency.  Some  smoothing  is  possible  by  prefetching 
the  heads  of  the  alternate  segments,  providing  initiatable 
microoperations  in  the  cycle  following  predicate  evaluation. 
These  prefetched  segments  are  buffered  in  the  Converger.  Thus  by 
using  lookahead  for  stream  control  microoperations,  the 
composition  of  microinstructions  across  block  boundaries  becomes 
possible. 

The  prefetching  of  microoperations  also  uncouples  their 
execution  from  Control  Memory  timing.  This  also  allows  the  fetch 
and  decoding  of  machine  instructions  (or  microprograms)  to  be 
overlapped  with  the  execution  of  preceding  machinfe  'instructions. 
This  facet  is  the  dynamic  counterpart  of  one  of  the  major 
contributions  to  the  execution  speedup  of  a  microcoded  algorithm 
over  its  equivalent  machine  instruction  version  (sec.  2.3.2)  .  A 
microcoded  algorithm  requires  a  single  microprogram  call. 
Afterwards,  no  machine  instructions  need  be  fetched  from  main 
memory.  On  the  adaptive  processor,  however,  the  necessary 
instruction  fetch  to  main  memory  may  interfere  with  data 
fetching.  This  interference  is  reduced  as  main  memory  speed  is 
increased . 
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From  this  brief  discussion,  it  is  apparent  that  the 
performance  of  an  adaptive  microprogrammable  control  unit  will 
depend  on  several  parameters.  These  are  the  Control  Memory 
bandwidth.  Converger  capacity,  and  the  Issuer  scan  rate,  the 
maximum  rate  at  which  microoperations  can  be  issued.  A. 
discussion  of  these  parameters  will  be  deferred  until  chapter  3. 


2 •  Central  P rocessinq  Un it  P er f o 
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depends  on  the  extent  of  detail  of  the  model.  Analytical  models, 
because  of  their  minimal  computing  requirements,  are  prime 
candidates  for  interactive  modeling.  Chapter  4  of  this  thesis 
will  be  devoted  to  analytical  performance  models  of  central 
processing  units. 


Detailed  analytical  central  processing  unit  performance 
models  have  only  recently  begun  to  receive  interest.  One  reason 
for  the  previous  lack  of  interest  was  the  simple  structure  of  the 
early  generations  of  central  processing  units.  Because  of  their 
basic,  sequential  flow  of  control,  the  instruction  execution  rate 
was  easily  calculated.  As  sophistication  of  central  processing 
unit  structures  increaser^,  with  higher  degrf=»es  of  overlapping  of 
control  functions,  instruction  execution  rate  calculations 
remained  relatively  easy  because  the  execution  phases  of 
consecutive  instructions  were  still  initiated  sequentially  and 
there  was  no  overlap  between  them.  It  was  not  until  multiple 
function  unit  central  processing  units  such  as  the  CDC6600  or  IBM 
360/91  were  conceived  that  the  complexity  of  the  modeling  process 
sharply  increased.  However,  central  processing  u'nit  modeling 
received  little  motivational  impetus  because  for  the  overwhelming 
majority  of  programs,  overall  system  performance  bottlenecks  were 
the  I/O  and  main  memory  bandwidths.  The  instruction  execution 
rate  was  not  a  sensitive  parameter  and  accurate  instruction 
execution  rates  were  not  necessary. 


Today's  technologies  have  improved  all  performance  aspects  of 
computer  systems.  The  rapidly  evolving  semiconductor  industry 
has  made  the  central  processing  unit  an  ever  decreasing  fraction 
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of  the  computer  system  cost,  making  highly  parallel  structures 
within  central  processing  units  economical.  Coupled  with  these 
technological  improvements,  there  has  been  a  trend  to  increasing 
the  sophistication  of  problems  solved  with  the  aid  of  computers. 
This  trend  has  increased  the  number  of  situations  capable  of 
effectively  using  the  enhanced  central  processing  unit  power, 
increasing  the  importance  of  the  instruction  execution  rate 
parameter.  Furthermore,  there  remain  technological  and 
economical  limitations  on  combining  circuit  elements.  Thus  the 
detailed  modeling  of  central  processing  units  is  necessary  to 
quantitatively  evaluate  architectural  components  of  the  central 
processing  unit.  This  is  particularly  true  for  microprocessors 
whose  primary  design  constraint  is  chip  area,  and  pin  count. 
Also,  in  many  cases,  microprocessors  have  become  the  system 
bottlenecks  [PAR76]. 

The  research  on  analytical  performance  models  for  multiple 
function  unit  central  processing  units  or  microprog ramm able 


issuance  of  successive  instructions  is  determined.  This  model 
characterizes  the  central  processing  unit  by 


nece 

ssar 

y  to 

the 

cen 

tral 

ropr 

oces 

sors 

pin 

CO 

unt . 

e  th 

e  sy 

stem 

for 

mult 

iple 

prog 

ramm 

able 

Torn 

g  [B 

T76  ] 

be  tw 

een 

the 

Th 

is  m 

odel 

a 

mac 

hi  ne 

of 

f  unc 

tion 

es 

when 

an 

r  ex 

ecut 

ion. 

St 

ruct 

ure. 

on  £ 

trea 

m  by 

ins 

true 

tion 

frequency  occurrence  distribution,  and  the  operand  and  predicate 


2-29a 


Figure  2.6 


Von  Neumann  Machine 


2-3  0 


dependency  distributions.  This  model  will  be  described  in 
chapter  4  along  with  modifications  that  improve  its  accuracy. 
Bowra  [BOW73]  also  gives  a  Markovian  model  to  determine 
instruction  execution  rate.  It  is  more  complicated  and  has  more 
restrictions  on  its  applicability  than  the  issue  delay  model. 
Tartar  and  Dasgupta  [TD74]  give  a  performance  model  for  a 
microprogrammable  central  processing  unit.  They  basically 
analyze  the  machine  instructions  and  their  constituent 
microinstructions.  For  a  given  distribution  of  machine 
instructions/  they  can  determine  the  number  of  microinstructions 
executed.  Their  technique  can  be  used  to  compare  the  emulation 
capabilities  [ RO S73 , SAL7 6 ]  of  microprogrammable  hosts  after  the 
microcode  has  been  generated.  Also,  Misunas  [MIS76]  gives 
performance  bounds  for  a  data-flow  processor  [EUM77],  and  Cheung 
and  Mowle  [CM76]  give  simulation  results  for  a  set  of  function 
units  executing  independent  instruction  streams. 


In  chapter  4,  analytical 
proce ssor  classes  will  be  derived 
organizations  in  the  sense  th 
structure,  figure  2.6,  and  tha 
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issue  policy  [BT76]  and  governs  how  and  when  instruct :i.ons  are 
transmitted  to  the  operation  unit  for  execution.  It  is  described 
in  detail  in  chapter  3.  The  machine  configuration,  in  addition 
to  specifying  the  function  unit  types,  their  number,  and 
execution  times,  also  specifies  parameters  for  the  control  unit. 
This  includes  window  capacity.  Issuer  scan  rate^  Control  Memory 
bandwidth,  and  the  number  and  type  of  virtual  function  units 
[ T0H67 , KEL75 ].  A  processor  class  is  basic  iff 

1.  microoperations  are  executed  by  the  operation  unit,  requiring 
only  one  function  unit, 

2.  each  function  unit  has  an  independent  output  bus,  and 

3.  all  function  units  can  simultaneously  access  the 
work  register  set. 

Condition  1  prevents  function  units  from  communicating  directly 
with  memory.  The  issue  delay  model  [BT76]  has  this  same 
restriction  and  precludes  machine  instructions  that  specify  main 
memory  operands.  The  other  conditions  simplify  the  analysis. 

The  input  to  the  basic  processor  model,  the  program  structure,  is 
similar  to  that  used  by  Bowra  and  Torng  [BT76].  Its  description 
is  deferred  to  chapter  4. 
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CHJ^.PTER  3 


The  adaptive  processor  is  a  technique  of  dynamically 
coordinating  computer  resources  to  detect  and  exploit  parallelism 
within  computations  having  a  relatively  high  incidence  of 
branches.  This  chapter  presents  system  descriptions  of  several 
adaptive  processors.  Although  the  adaptive  processor  is  more 
generally  applicable,  its  description  is  given  in  the  context  of 
micro  prog  ramming. 


The  adaptive  processor  designs  presented  here  are  heavily 
influenced  by  possible  decompositions  of  the  machine  instruction 
interpretation  process  or  the  similar  microoperation 
interpretation  process,  into  sets  of  coordinated  tasks.  The 
first  section  shows  how  some  commercially  available  processors 
have  had  these  processes  subdivided  into  constituent  tasks.  P. 
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The  next  section  describes  Conditioning  schemes.  These 
schemes  permit,  in  certain  situations,  the  outcome  of  a  branch  to 
be  determined  earlier  in  time  than  would  be  possible  on  a 
comparable  conventional  processor.  The  adaptive  processor  uses 
look-ahead  techniques  [  KEL75  ]  in  conjunction  with  a  Conditioning 
scheme.  Several  block  levels  of  microoperations  are 
conditionally  prefetched  and  stored  in  a  Converger  mechanism  to 
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provide  a  high  availability  of  microoperations  to  the  Issuer  and 
Control  Buffer*  This  is  discussed  in  the  third  section  that 
describes  the  main  components  of  an  adaptive  processor. 
Afterward,  execution  hazards  and  the  conditions  for  safe 
execution  of  microoperations  cire  described.  These  conditions  are 
combined  with  information  about  the  machine  operating  state  to 
formulate  issue  and  execution  policies  for  a  series  of  Issue 
Control  Buffer  pairs  that  might  be  used  in  adaptive  processors. 

Following  this  series  of  descriptions,  a  multiply  algorithm 
is  used  to  illustrate  some  of  the  actions  within  an  adaptive 
processor  as  microinstructions  are  composed  dynamically.  The 
^^hapter  closes  with  a  discussion  of  some  considerations  in 
generating  microcode  for  adaptive  processors,  a  summary  of 
advantages  and  disadvantages  of  the  organization,  and  some 
suggestions  for  extensions  and  further  research. 

22ii.i22l  Policie s 

In  this  section  we  describe  the  concept  of  a  job  and  how  it 
is  adapted  to  the  interpretation  of  machine  instructions  and 
microoperations.  The  description  will  use  Chen's  application  of 
assembly  line  concepts  to  overlap  processors  [CHE75]  and  the 
notion  of  a  task  as  described  by  Coffman  and  Denning  [CD73].  We 
also  describe  the  Control  Policy  which  is  an  extension  of  the 
Issue  Policy  used  by  Bowra  and  Torng  [BT76].  The  Control  Policy 
specifies  the  conditions  that  regulate  the  execution  of  an 
instruction  on  multiple  function  unit  central  processing  units. 
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The  constituent  of  a  job,  called  a  task,  is  an  atomic  unit  of 
activity  that  is  specified  in  terms  of  its  external  behaviour 
its  function,  inputs,  outputs  and  completion  time,  A  Job  is  a 
task  system  (T,<*)  where  T  is  a  set  of  tasks  {T^ }  that  is 
coordinated  by  a  relation  <•  called  operational  precedence.  T; 
<•  Tj  means  that  task  must  be  completed  before  Tj  begins.  A 
job,  (T,<*),  can  be  represented  by  a  directed  graph  called  a 

The  node  set  of  the  graph  is  T  and  a  directed 
edge  (T^,T^)  is  defined  whenever  T'^<«  Tj  and  there  is  no 
intermediate  node  such  that  <«  Tj  .  There  are  two 

events  associated  with  each  task.  There  is  task  initiation, 

occurring  when  a  task  begins  its  activity  and  task  termination 
occurring  when  a  task  completes. 

Given  a  specific  job  to  be  performed,  a  set  of  resources  must 
be  provided  for  its  accomplishment.  It  may  be  possible  to 
decompose  the  job  into  a  set  T  of  coordinated  tasks,  each  of 
which  may  be  executed  using  some  subset  of  the  available 

resources.  Such  job  decompositions  can  significantly  increase 
the  job  completion  rate  by  exposing  tasks  from  the 'same  job  that 
can  be  concurrently  executed,  or  tasks  from  different  jobs  that 

are  in  various  stages  of  completion.  This  concept  has  been 

successfully  used  to  design  high  throughput  production  assembly 
lines  and  pipelined  processors. 

^  ^  ^  subpolicies,  {Pi}  ,  one  for 

each  constituent  task.  Each  subpolicy  Pi  specifies  the 
conditions  for  the  initiation  and  completion  of  task  Ti .  The 
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conditions  must  ensure  that  the  task  will  be  executed  correctly 
and  that  there  will  be  resources  for  its  execution. 

For  central  processing  units,  there  are  two  classes  of 
similar  jobs  that  we  will  examine.  These  are  the  interpretation 
of  a  machine  instruction  and  the  interpretation  of  a 
microoperation.  The  major  phases  of  these  jobs  have  been 
mentioned  in  chapter  2  and  include  fetch,  decode,  operand  fetch, 
execution,  and  result  transmission.  The  segmentation  of  these 
phases  into  a  task  set  varies  for  different  computers  depending 
on  the  design  philosophy  and  performance  objectives  used  in  the 
central  processing  unit  implementation.  Examples  of  job 
decompositions  and  control  policies  used  on  different  computers 
are  given  below.  These  decompositions  illustrate  only  major 
features  of  the  machines. 

DEC  PDP-8 

In  a  PDP-8  there  is  no  effective  concurrency  used  in  the 
interpretation  of  a  machine  instruction.  Consequently  there  is 
only  one  task  in  this  job.  The  control  policy  has  only  one 
subpolicy  which  may  be  stated  as: 

EX ECU  TE  INIT;  When  the  previous  machine  instruction  has  terminated 

TEEM:  When  the  generated  results  have  been  transmitted 
to  their  destinations  in  the  memory  or  in  the 
register  set. 

The  termination  condition  has  been  generally  stated  and  is 
intended  to  convey  the  effect  of  an  instruction.  For  example,  a 
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HALT  instruction  is  considered  to  transmit  a  *0'  to  the  ENABLE 
flip-flop  that  controls  the  running  of  the  system. 

0  Control  Un it 

At  the  microprogram  level  in  the  control  unit  of  the  360/50, 
the  job  of  interpreting  a  microinstruction  has  two  concurrent 
tasks  -  microinstruction  fetch  and  control  transmission.  The  two 
subpolicies  are 

INIT:  When  the  previous  microinstruction  has  been  placed 
in  the  Microinstruction  Register  (MIR) . 

TERM:  When  the  microinstruction  has  been  placed  in  the  MIR. 

TRANSMIT  INIT:  When  the  m ^ croinst ru ct ion  has  been  placed  in  the  MIR, 

according  to  the  system  clock  pulses. 

TERM:  When  transmission  of  the  last  control  pulse 
is  completed. 

In  this  case  all  the  actions  are  controlled  by  a  system  clock, 
giving  rise  to  the  well  defined  transitions  between  the  FETCH  and 
TRANSMIT  tasks. 

CDC_6 6 00 

The  CDC  6600  [TH064]  is  a  more  complex  example  of  the 

segmentation  of  the  machine  instruction  interpretation  job. 
Instruction  words  of  sixty  bits  are  fetched  from  main  memory  and 
placed  into  the  BUFFER  register,  the  bottom  of  the  eight  word 
INSTRUCTION  STACK.  This  action  pushes  up  the  old  instruction 
words.  In  the  case  of  a  backward  branch  to  an  instruction 
contained  in  the  STACK,  fetching  from  main  memory  is  suspended  as 
instructions  are  retrieved  from  the  STACK. 
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Instructions  from  the  BUFFER  and  the  bottom  STACK  entry  are 
passed  to  a  generalized  queue  and  reservation  scheme  called  the 
SCOREBOARD.  The  SCOREBOARD  maintains  the  instantaneous  status  of 
all  registers  and  function  units,  and  can  decide  if  an 
instruction  can  be  issued  to  a  function  unit,  and  when  it  can  be 
initiated.  The  tasks  of  the  machine  instruction  job  are  FETCH  an 
instruction  word  from  memory  to  BUFFER,  STACK  the  EUFFEP  entry, 
ISSUE  an  instruction,  and  EXECUTE  an  instruction.  These  tasks 
define  segments  in  a  pipeline  that  interprets  the  machine 
instruction.  Each  can  be  operating  concurrently  on  different 
machine  instructions  from  the  instruction  stream. 
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IBM  360/91  Floating  Point  Unit 


The  actions  used  in 
instructions  on  the  360/91 
used  by  the  CDC  6600.  One 
the  operands  may  reside  in 
calculation  and  a  memory 
is  that  the  result  operand 
EXECUTE  task. 


the  job  of  executing  floating  point 
[TOM67,  RL77  ]  are  similar  to  those 
exception  is  that  on  some  instructions 
main  memory,  requiring  an  address 
fetch.  Another  significant  difference 
transmission  is  separated  from  the 


Instructions  are  fetched  and  placed  in  an  instruction  BUFFER. 
When  a  conditional  branch  is  detected,  conditional  preprocessing 
of  the  main  instruction  stream  continues  while  sixteen  bytes  from 
the  branch  target  address  are  loaded  into  the  BRANCH  BUFFER.  If 
a  backward  branch  to  an  instruction  within  64  bytes  is  detected, 
a  loop  mode  is  entered  and  instructions  are  executed  our  of  the 
BUFFER.  When  instructions  are  passed  from  the  BUFFER  and  into 
the  Floating  Point  Operation  STACK,  any  instruction  requesting  a 
main  memory  operand  is  converted  into  a  register-register  type 
instruction.  The  conversion  is  accomplished  by  creating  a  buffer 
load  instruction  that  places  the  operand  value  into  a  specified 
buffer  register  and  then  specifying  this  buffer  register  in  the 
corresponding  operand  field  of  the  register-register  instruction. 
From  the  STACK,  the  instructions  are  issued  to  a  VIRTUAL  FUNCTION 
UNIT  that  buffers  the  instruction  while  it  waits  for  any  of  its 
source  operands  or  function  unit.  If  a  source  operand  is 
available  in  the  register  at  this  time,  its  value  is  transmitted 
to  the  virtual  function  unit.  Issuing  occurs  only  from  the  first 
STACK  position  and  halts  only  if  there  is  no  virtual  function 
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unit.  Execution  begins  when  all  sources  and  the  requested 
function  unit  are  available.  After  execution  terminates,  the 
result  is  buffered  in  the  function  unit  result  register  until  the 
COMMON  DATA  BUS  becomes  available.  The  result  is  then 
transmitted  via  the  COMMON  DATA  BUS  to  all  destinations  that 
require  the  result  value.  These  can  be  any  one  of  the  floating 
point  registers,  any  of  the  virtual  function  unit  source  buffers 
or  any  buffer  in  the  memory  write  queue. 


The 

following  tasks  are 

used 

to  describe 

the 

job 

of 

interpreting  a  floating  point 

m  a  ch  in  e 

instruction 

on 

the 

IBM 

360/91  : 

FETCH  the  machine 

instruction,  DECODE 

the 

machine 

instruction,  initiate  the  OPERAND  ACCESS,  STACK,  ISSUE,  and 
EXECUTE  the  machine  instruction  and  FORWARD  and  DEPOSIT  the 
result  value  in  the  destination.  These  are  approximations  to  the 
actual  tasks  (see  [  RL77  ]  for  more  detail)  but  they  illustrate  the 
job  decomposition  possibilities.  Figure  3.1  illustrates  the 
precedence  structure  of  the  job. 
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Figure  3«  1  IBM  360/91  Floating  Point  Execution  Job  Structure 

The  diagram  shows  that  the  OPERAND  ACCESS  task  can  occur 
simultaneously  with  STACK  and  ISSUE  tasks.  The  > DEPOSIT  task 
occurs  twice.  The  last  occurrence  indicates  its  precedence  in 
interpreting  the  current  machine  instruction.  The  first 
occurrence,  preceding  EXECUTE,  is  the  DEPOSIT  of  a  value 
generated  by  a  preceding  machine  instruction.  The  subpolicies 
regulating  these  tasks  are: 

FETCH  INIT:  when  (1)  memory  not  busy, 

and  (2)  BUFFER  not  full, 

and  (3)  a  branch  is  detected  (alternate  Branch 
BUFFER  is  used). 

TERM:  when  (1)  BUFFER  full 

or  (2)  Loop  mode  is  entered. 
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DECODE 


INIT:  when  (1) 
and  (2) 


BUFFER  not  empty 
STACK  not  full. 


TERM:  when  (1) 
or  (2) 


BUFFER  empty 
STACK  full. 


OPERAND  ACCESS 

INIT:  when  (1) 
and  (2) 


BUFFER  not  empty 

Memory  operand  requested  by  a  decoded 
in  struction. 


TERM:  when  memory  access  has  been  initiated. 


STACK 


INIT: 


when  (1)  BUFFER  not  empty 
and  (2)  STACK  not  full. 


TERM:  when  (1) 
or  (2) 


BUFFER  empty 
STACK  is  full. 


ISSUE 


INIT: 


when 

(1) 

STACK  not 

empty,  . 

and 

(2) 

the  execu 

tion  condition 

is  on. 

an  d 

(3) 

a  virtual 

function  unit 

is  available. 

an  d 

(4) 

there  is 

no  blocked  instruction  in  the 

STACK. 

TERM:  when  (1 ) 
or  (2) 
or  (3) 
or  (U) 


STACK  empty, 

no  execution  condition  on, 

required  virtual  function  unit  is  not  available, 
an  instruction  is  blocked  in  the  STACK. 


^ECUTS  INIT:  when  (1) 

and  (2) 


both  instruction  sources  are  at  the  virtual 
function  unit 

the  requested  function  unit  is  available. 


TERM:  when 


the  result  is  in  the  function  unit  output  buffer. 


FORWARD  INIT:  when 

an  d 


(1)  result  is  in  function  unit  output  buffer 

(2)  the  Common  Data  Bus  is  available. 


TERM  : 


when  the  value  has  been  transmitted  alonq  the  CDB 


DEPOSIT  INIT:  when  (1)  value  is  on  the  Common  Data  Bus 

(2)  the  value  is  requested  by  the  register, 
memory  buffer,  or  virtual  function  unit 
source  field. 


TERM:  when  value  has  been  stored  in  the  destination. 


3.1.1  The  Microoperation  Interpretation  Proce ss 


The  previous  section  has  described  the  subdivision  of  jobs 
into  tasks.  Such  subdivisions  have  been  used  to  design  computer 
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systems  having  high  computation  rates  [CHE75].  We  will  examine  a 
decomposition  of  the  interpretation  of  microoperations. 
Considering  that  conventional  microprogrammable  central 
processing  units  use  only  an  elementary  segmentation  of  the 
activities  that  interpret  a  microoperation,  the  possibility  of  a 
finer  decomposition  suggests  that  the  adaptive  processor  could 
benefit  from  its  exploitation.  The  extent  to  which  this  is 
possible  will  be  examined.  In  this  and  the  following  sections  we 
examine  microoperation  interpretation  and  consider  some  factors 
in  selecting  hardware  structures  and  control  policies  to 
implement  an  adaptive  processor. 


The  job  of  interpreting  microopprations  is  analogous  to 
machine  instruction  interpretation  as  described  in  section  3. 1. 
It  is  clear  that  a  decomposition  will  depend  on  the  hardware  of 
the  control  unit.  In  this  section  we  assume  that  a  traditional 
microprogrammable  control  unit  with  certain  hardware  enhancements 


se 

enhan 

ce 

me 

nts 

mo 

re 

or 

er 

who 

1 

s 

com 

po 

si 

ng 

er 

exami 

ne 

s 

a  St 

ri 

ng 

of 

algorithm 

/ 

sel 

ec 

ti 

ng 

ion 

• 

fun 

da 

me 

ntal 

er 

is 

th 

at 

he 

c 

an 

sel 

microinstruction.  h  fundamental  restriction  to  the 
micro  prog  rammer 

from  a  single  block.  We  will  examine  hardware  possibilities  that 
reduce  this  and  other  restrictions  imposed  by  a  classical 
microprogrammable  architecture.  The  decomposition  described  in 
this  section  will  be  partly  based  on  the  activities  of  the 
microprog ramm er  as  he  composes  a  micr oinsiruction  and  partly  on 
the  additional  functions  of  an  adaptive  processor.  This 


3-12 


microoper ation  interpretation  job  can  be  divided  into  three 
distinct  phases  -  provision,  disposition,  and  execution.  These 
are  described  below. 

Frovi sion  Phase 

The  primary  function  of  the  provision  phase  is  to  ensure  an 
adequate  supply  of  microoperations.  One  of  the  design 
objectives,  the  ability  to  compose  microinstructions  across  block 
boundaries,  makes  exceptional  demands  on  the  provision  phase.  On 
a  conventional  microprogrammable  central  processing  unit,  the 
microprogrammer  selects  microoperations  for  a  microinstruction 
only  from  a  single  block  of  microoperations.  The  adaptive 
processor,  in  addition,  requires  access  to  segments  of  both 
succe5:sor  blocks.  The  microoperation  streams  needed  by  the 
adaptive  processor  for  microinstruction  composition  are  the 

consisting  of  microoperations  from  the  current 
block,  and  two  conditional  streams  called  TRUE  and  F^.LSE  obtained 
from  the  current  block  successors.  Access  to  these  three  streams 
will  permit  the  adaptive  processor  to  examine  microoperations 
across  block  boundaries.  Under  certain  conditions,  this  will 
allow  the  composition  of  a  microinstruction  containing 
microoperations  from  a  block  and  its  successor. 

The  following  tasks  will  be  used  to  implement  the  provision 


phase : 
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FETCH;  Obtain  the  needed  microoperations.  At  least  the  ACTIVE 
block  and  the  initial  segments  of  the  TRUE  and  FALSE  blocks 
will  be  needed.  In  a  later  section  we  will  see  that 
prefetching  the  successors  to  at  least  the  TRUE  and  FALSE 
blocks  will  be  needed  to  maintain  an  adequate  supply  for 
certain  microprograms. 

DECODE:  Detect  branches  and  identify  block  boundaries.  These 

are  the  essential  activities  of  this  task.  It  will  also  be 
useful  at  this  point  to  extract  information  about  the 
microoperation  that  will  be  used  later.  This  could  include 
execution  time,  and  whether  or  not  a  condition  code  i.s  to  be 
generated.  Also,  the  microoperation  could  be  reformatted  to 
some  standard  format  that  is  convenient  to  use  by  succeeding 
tasks. 

CONDITION;  Assign  an  execution  condition  to  each  block.  The 
execution  condition  specifies  which  predicate  values  must 
occur  for  the  microoperation  to  be  executed. 

BUFFER;  Store  microoperation  block  segments  in  buffers. 

Several  buffers  are  necessary  so  that  blocks  can  be 
simultaneously  accessed.  Buffers  exist  for  both 

microoperations  just  entering  for  interpre  tu.  cj.  on  and  for 
microoperations  that  have  been  delayed. 


In  the  microoperation  disposition  phase,  microoperations  are 
examined  for  placement  in  the  microinstruction  that  is  being 
currently  composed.  In  the  adaptive  processor,  conditional 
microoperations  are  also  passed  into  this  phase.,  Because  the 
current  microinstruction  is  being  composed  dynamically,  three 
conditional  microinstructions  must  be  formed  simultaneously  to 
cover  the  possibilities.  The  ACTIVE  microinstruction  contains 
only  ACTIVE  microoperations  and  is  initiated  in  the  following 
cycle  only  if  the  predicate  has  not  been  evaluated.  The  other 
two  microinstructions  contain  either  ACTIVE  and  TRUE  or  ACTIVE 
and  FALSE  microoperations.  These  are  initiated  in  the  following 


cycle  only  if  the  ACTIVE  predicate  has  been  evaluated  and  its 
value  corresponds  to  the  conditional  microinstruction  type. 
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For  each  microoperat ion  in  the  disposition  phase,  activities 
must  be  initiated  to  determine  the  microoperation  requirements, 
to  determine  if  the  system  can  provide  them,  and  then  decide  the 
appropriate  disposition.  The  requirements  can  be  classified  into 
three  categories: 

(^)  The  microoperation  must  be  kept  in  its  correct 

stream  -  ACTIVE,  TRUE  or  FALSE. 

(t))  The  microoperation  must  be  able  to  access  the  values 

in  the  specified  sources  and  be  able  to  deposit 
its  results  in  the  specified  destinations 

(c)  R  e source:  The  microoperation  must  have  the  required 

function  unit  to  begin  execution  and  the  specified 
destination  tc  complete  execution. 

The  first  two  requirements  relate  to  correct  microprogram 

execution  and  will  be  further  discussed  in  section  3.4  on 

execution  hazards.  The  third  requirement  reflects  the  need  of 

physical  resources  to  execute  the  microoperation.  If  all  of 

these  needs  are  satisfied,  the  resources  are  assigned  to  the 

microcperation  and  the  microoperation  is  passed  to  the  execution 

phase.  If  any  of  these  requirements  is  not  satisfied,  the 

specified  destination  is  marked  invalid  to  succeeding 

fflicroope rations  and  the  issue  of  the  microoperation  is  delayed  to 

at  least  the  following  cycle.  An  invalid  work  register  cannot  be 

used  for  source  access  or  result  deposits.  The  following  tasks 

are  used  to  subdivide  the  disposition  phase: 

DIRECT:  Direct  the  microoperation  to  the  appropriate  microinstruction 
according  to  its  execution  condition 

-  ACTIVE  microoperations  to  all  three  microinstructions. 

-  TRUE  microoperations  to  TRUE  microinstruction’  only. 

-  FALSE  microoperations  to  FALSE  microinstruction  only. 

SOURCE:  verify  that  the  specified  sources  contain  valid  data 
DESTINATION :  verify  that  the  specified  destination  will  be 
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availabl  e 

FUNCTION :  verify  that  a  required  function  unit  is  available 

ISSUE:  Send  the  microoperation  to  the  selected  microinstruction 

register  if  the  SOURCE,  DESTINATION  and  FUNCTION  tests  have 
all  been  successful  and  assign  the  requested  resources. 

DELAY:  Reexamine  the  microoperation  in  the  next 

cycle  if  any  of  the  SOURCE,  DESTINATION  and  FUNCTION 
tasks  have  been  unsuccessful. 


Pha  se 

Once  a  microoperation  is  issued,  it  enters  the  execution 
phase  and  many  of  the  activities  occur  in  the  operation  unit. 

The  activities  are  summarized  in  the  following  tasks: 

SELECT:  Select  one  of  the  three  conditional  microinstructions  for 
execution  as  follows: 

ACTIVE  -  If  no  predicate  was  evaluated  last  cycle. 

TRUE  -  If  the  predicate  evaluated  last  cycle  was  TEU 

FALSE  -  If  the  predicate  evaluated  last  cycle  was  FAL 

INITIATE  I  Connect  the  sources  to  the  function  units  according  to  the 

activated  microinstruction  and  initiate  the  executions. 

EXECUTE:  Perform  the  requested  function  unit  operations  on  the  inputs 

DEPOSIT:  Deposit  the  result  in  the  specified  destination. 

STATUS:  Set  the  condition  codes  according  to  the  outcome  of 
the  execution. 

PREDICATE:  Evaluate  the  predicate  (if  a  branch  can  be  executed). 

ACTIVATE:  Activate  the  conditional  microoperation  stream  selected 
by  the  predicate  evaluation. 

Several  of  these  tasks  do  not  have  counterparts  in  classical 
microprog rammable  control  units  and  are  given  additional 
explanation.  The  SELECT  task  is  executed  at  the  end  of  a  control 
cycle  and  it  determines  which  one  of  the  three  composed 
microinstructions  is  to  be  initiated.  The  predicate  evaluation 
is  performed  in  the  Predicate  Evaluator  of  the  Stream  Controller. 


to  w 
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In  classical  microprogram mabl e  control  units,  predicate 
evaluation  is  undertaken  by  a  branch  microoperation  in  the  last 
microinstruction  of  a  block.  The  adaptive  processor,  however, 
detects  branches  when  a  control  word  of  microoperations  is  passed 
to  the  Stream  Controller.  The  branch  is  executed  as  soon  as  its 
corresponding  condition  code  setting  microoperation  is  executed. 
In  many  situations  this  allows  the  successor  block  to  be 
activated  before  the  current  block  completes  execution.  Thus, 
depending  on  whether  the  currently  active  predicate  is  evaluated, 
and  on  its  value,  the  adaptive  processor  must  select  one  of  the 
three  possible  microinstructions. 

On  a  convent i.onal  central  processing  unit,  the  STATUS  task  is 
performed  automatically  during  the  execution  of  machine 
instructions  that  affect  the  condition  codes.  On  an  adaptive 
processor,  the  STATUS  task  must  be  made  explicit  for  two  reasons. 
First,  for  performance  reasons,  the  condition  code  setting 

microoperation  need  not  be  set  just  prior  to  a  branch  execution 
as  on  conventional  microprogrammable  central  processing  units. 
Indeed,  this  microoperation  should  be  executed  at  its  earliest 

moment  to  permit  the  earliest  activation  of  the  selected 

successor  stream.  Second,  in  the  case  where  several  identical 
function  units  are  completing  simultaneously  and  only  one  is  to 
set  the  conditon  codes,  the  selected  codes  must  be  indicated 

explicitly. 


Most  conventional  central  processing  units  do  not  need  a 


n 


ACTIVATE  task  because  successor  blocks  are  not  prefetched.  The 
360/91  described  earlier  is  an  exception  because  the  initial  16 
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bytes  of  the  alternate  successor  stream  are  prefetched.  The 
adaptive  processor,  as  will  be  described  in  section  3.3.3, 
prefetches  considerably  more.  The  ACTIVATE  task  after  a 
predicate  evaluation  determines  which  of  the  prefetched  streams 
are  to  be  retained.  It  impacts  the  CONDITION,  DIRECT  and  SELECT 
tasks  described  earlier. 

The  structure  of  this  decomposition  of  the  microoperation 
interpretation  process  is  shown  in  figure  3.2.  This  diagram 
shows  the  precedence  between  the  constituent  tasks.  Some  tasks 
appear  in  two  places.  In  the  cases  of  multiple  copies,  the 
instances  with  no  exit  edges  are  initiated  by  tasks  operating  on 
the  current  microoDerati on .  Those  with  no  entry  edges  are 
initiated  by  tasks  operating  on  a  predecessor  microoperation. 
Also,  some  tasks  are  mutually  exclusive  -  ISSUE  and  DELAY. 
Others  are  conditionally  executed  -  the  task  chain  beginning  with 
STATUS. 

The  decomposition  described  above  is  intended  to  be  neither 
complete,  irreducible,  nor  immutable.  The  description  is 
intended  to  provide  enough  substance  and  detail  on  the 
microoperation  interpretation  process  to  illustrate  its 
complexities  and  to  provide  a  background  for  the  developments  in 
future  sections.  For  example,  it  should  be  noted  that  the 
activities  of  the  disposition  phase  are  performed  on  classical 
micro prog rammable  central  processing  units  by  the  microprogrammer 
at  microcode  generation  time.  In  a  dynamic  environment  more 
flexibility  exists.  With  appropriate  hardware,  such  as  result 
buffers,  the  DESTINATION  task  could  be  incorporated  into  the 
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DEPOSIT  task  of  the  execution  phase.  If,  in  addition,  result 
forwarding  [KEL75]  is  used,  the  SOURCE  task  could  be  merged  with 
the  INITIATE  task,  greatly  reducing  the  complexity  of  the 
disposition  phase.  The  360/91  [TOM67]  uses  these  concepts  in  the 
floating  point  unit. 

In  this  thesis,  our  main  description  of  an  adaptive  processor 
will  be  based  on  the  decomposition  of  the  microoperation 
interpretation  process  as  described  above.  This  is  not  to  imply 
that  this  is  the  best,  most  powerful,  or  simplest  possible 
organization.  It  is  used  as  an  example  because  it  is  a 
conceptually  simple  extension  of  conventional  microprogramra able 
processors.  Other  sections  will  discuss  the  incorporation  of 
source  buffering,  result  buffering,  virtual  function  units,  and 
operand  forwarding. 


t 
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Figure  3.2  Adaptive  H icroprog ramm able  Task  System 

3.2  Condition ing  Sche mes 

Before  describing  the  organi 2:a tion  of  an  adaptive  processor, 
we  will  discuss  and  compare  several  methods  that  permit,  under 
certain  program  conditions,  a  speed-up  of  predicate  evaluations. 
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Such  methods  are  necessary  to  allow  the  composition  of 
microinstructions  across  block  boundaries. 

On  conventional  machines,  predicate  evaluation  within  a  basic 
block  is  deferred  until  after  all  microoperations  in  the  block 
have  been  initiated.  Thus  the  condition  code  setting 
microoperation  resides  in  the  block’s  tail,  usually  just  before 
the  branch.  An  essential  feature  of  the  adaptive  processor  is 
the  uncoupling  of  the  condition  code  microoperation  from  this 
traditional  position,  permitting  in  many  cases  earlier  branch 
resolution.  This  uncoupling  will  be  called  condit ioni ng.  In 
this  section  we  will  examine  several  conditioning  strategies  that 
permit  varying  degrees  of  mobility  for  the  condition  code 
microoperation. 

In  conditioning  schemes  where  the  branch  microoperation  is 
used  to  terminate  blocks,  storage  for  the  condition  code  must  be 
provided  until  its  value  is  referenced  by  the  corresponding 
•branch.  Storage  schemes  used  for  temporary  data  variables  are 
natural  candidates  for  condition  code  storage  strategies.  A. 
successful  conditioning  strategy  will  have  a  subset  of  the 
following  desirable  properties: 
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1.  Fast  Response  -  A  fast  Converger  response  to  control  flow 

decisions  is  desirable.  This  can  be  accomplished  with  a 
multi-predicate  evaluation  *  per*  cycle  capability. 

2.  Simple  Controls  -  Simplicity  is  desirable  because  of  the  low 

cost  and  high  degree  of  reliability. 

3.  Transportability  -  Schemes  that  least  affect  the 

transportability  of  microcode  between  operationally 

similar  adaptive  processors. 

4.  Control  Compatabi lity  -  Compatability  with  microcode  on  an 

operationally  similar  microprogram  central  processing 
unit.  The  more  compatible  a  conditioning  scheme,  the 
easier  to  exploit  existing  microcode. 

3.2.1  Blocked  Conditioning 

Blocked  Conditioning  is  the  simplest  extension  of  the 
traditional  conditioning  used  on  conventional  microprogrammable 
processors.  In  this  scheme,  the  branch  microoperation  terminates 
a  microcode  block.  The  forward  mobility  of  the  condition  code 
microoperation  is  restricted  only  by  its  data  dependencies  within 
the  block,  but  it  must  be  positioned  within  the  block. 

Only  one  condition  code  register  is  required  for  condition 
code  storage.  A  Ready  flag  is  necessary  to  communicate  condition 
code  availability  to  Control.  It  is  set  when  thfe  condition  code 
is  generated  and  reset  when  the  predicate  is  evaluated.  To 
identify  a  condition  code  microoperation  an  additional  bit  field 
must  be  provided  in  the  microoperation  format.  This  bit,  when 
detected  in  the  operation  unit,  activates  the  setting  of  the 
condition  code  register  by  the  executing  microoperation. 

Using  Blocked  Conditioning,  only  the  predicate  of  the  Active 
arm  can  be  evaluated.  Thus  only  two-way  branches  can  be  resolved 
per  cycle.  Nevertheless  an  adaptive  processor  with  Blocked 
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Conditioning  can  in  many  instances  match  or  exceed  the 
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Transportability  of  microcode  between  adaptive  processors 
with  Blocked  Conditioning  is  not  affected.  In  fact,  except  for 
the  change  in  microoperation  formats,  microcode  from 
operationally  similar  microprogr ammable  control  units  would  be 
transportable. 


3.2.2  Nested  Condi 

Nested  Conditioning  is  a  less  restrictive  sch 
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structures.  The  sufficient  conditions  for  uncoupling  the 
condition  code  microoperation  from  its  branch  are  that  (1)  data 
constraints  are  not  violated  and  (2)  only  coupled  condition  code 
microoperation-branch  pairs,  if  any,  occur  in  the  stream  between 
them.  Examples  are  given  in  figure  3.3a. 

The  example  shows  a  contiguous  section  of  memory  where  the 
locations  of  condition  code  microoperations  and  branches  is 
indicated.  These  points  are  labelled  to  indicate  their  stack 
levels  -  Ci  labels  the  position  of  a  condition  code 
microoperation  whose  condition  code  is  placed  in  the  level  i 
stack  cell  and  Bi  is  the  conditional  branch  that  references  the 
level  i  stack  cell.  An  example  of  nested  loops  is  shown  where 
the  condition  code  microoperations  are  contained  within  the  loop 
body.  The  first  occurrences  of  C1  and  C2  indicate  condition  code 
microoperations  outside  the  block  containing  their  branches.  In 
this  example,  the  first  CO  corresponds  to  the  forward  branch 
occurring  after  the  nested  loops.  The  second  use  of  CO  and  Cl 
illustrates  a  code  segment  where  one  of  four  mutually  exclusive 
blocks  (FF,  FT ,  TF,  TT)  is  to  be  executed.  Nested  conditioning 
could  be  exploited  because  the  condition  code  generated  at  CO  is 
used  by  both  branches  BO.  Note  that  only  one  branch  EO  will  be 
executed. 

Simple  modifications  to  this  program  structure  can  illustrate 
programs  that  cannot  exploit  Nested  Conditioning.  This  is  shown 
in  figure  3.3b.  Branch  B4  gives  rise  to  an  abnormal  ekit  (as  in 
a  FORTRAN  DO  loop),  destroying  any  nesting  possibilities  for  the 
branches  B0-B3  that  are  branched  around  by  B4 .  In  these  cases. 
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Memory  Layout  Of  A  Program  Using  Nested  Conditioning 
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Figure  3.3b  Memory  Layout  Of  A  Program  IVhich  Cannot  Exploit  Nested  Conditioning 


/ 


/ 


_ 


/  / 

-yVxL- 


/ 


CO  CA-Cl  c^^b4 
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the  mobility  range  of  condition  code  microoperations  under  Nested 
Conditioning  is  reduced  to  that  of  Blocked  Conditioning* 
Condition  code  roicrooperation  mobility  is  easily  thwarted  in  the 
case  of  forward  branching*  If  in  figure  3.3a,  the  last 
occurrence  of  BO  is  changed  to  a  B2  and  does  not  reference  the 
condition  code  generated  at  CO,  the  mobility  under  Nested 
Conditioning  is  again  reduced  to  that  of  Blocked  Conditioning. 
Begister  Conditioning,  discussed  in  the  next  section,  can 
overcome  these  reductions  of  potential  mobility. 

filthough  Nested  Conditioning  is  conceptually  straightforward, 
its  implementation  is  complicated  in  the  adaptive  processor 
because  the  stream  order  and  execution  order  of  mic rooperat ioTi s 
is  not  constrained  to  be  the  same.  To  guard,  against  potential 
execution  order  transpositions,  condition  codes  are  tagged  v/hen 
the  condition  code  microoperation  is  passed  to  the  Converger. 
The  tag  identifies  the  stack  cell  assigned  to  the  condition  code. 
A  modified  LIFO  stack  memory  is  needed  to  store  the  condition 
codes  for  later  reference  by  the  mating  branch  microoperations. 
This  stack  memory  also  requires  a  random  access'  mechanism  to 
write  a  tagged  condition  code  into  the  stack.  In  addition  each 
stack  cell  requires  a  status  bit  to  indicate  the  presence  or 
absence  of  a  valid  condition  code. 

Appendix  A  examines  the  implementation  of  a  Nested 
Conditioning  scheme.  This  appendix  describes  the  system  blocks 
and  the  actions  taken  when  a  condition  code  microoperation  or 
branch  is  detected  and  when  a  condition  code  is  generated.  The 
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scheme  can  resolve  four  predicates/cycle  if  both  the  condition 
code  and  branch  for  each  predicate  have  been  detected. 

For  certain  program  structures.  Nested  Conditioning  can 
improve  performance  over  Blocked  Conditioning.  However,  the  cost 
is  much  greater  hardware  complexity.  In  addition,  to  exploit  its 
advantages,  the  microcode  must  be  preprocessed  to  determine 

condition  code  microoperation  mobility.  hs  with  Blocked 

Conditioning,  an  explicit  bit  field  is  needed  in  the 

microoperation  to  indicate  which  microoperations  are  to  generate 
condition  codes.  Note  that  unless  stack  overflow  is 

automatically  managed  by  hardware,  the  microcode  is  dependent  on 
Lhe  number  of  stack  cells,  potentially  reducing  i+s 
transportability. 

3.2.3  Heg i s t e r  Conditioni ng 

Register  Conditioning  is  a  generalization  of  Nested 

Conditioning.  In  this  scheme,  a  set  of  addressable  registers  is 

provided  for  condition  code  storage  and  the  condition  code 
microoperation  explicitly  specifies  which  register  is  assigned  to 
its  condition  code.  Because  the  nesting  of  condition  code 
microoperations  and  branches  is  not  required,  condition  code 
microoperation  mobility  potential  is  increased.  Figure  3.3c 
shows  the  memory  layout  of  a  program  segment  which  can  exploit  a 

Register  Conditioning  scheme  with  at  least  5  condition  code 

t 

registers. 

The  additional  mobility  is  not  without  its  drawbacks. 
Redundant  executions  of  condition  code  microoperations  become 


3-26 


common,  In  figure  3,3c,  if  the  branch  B4  transfers  control  to 
its  target  address,  condition  code  microoperations  C0-C3  have 
been  needlessly  executed.  Also,  for  the  second  occurrences  of  CO 
and  C2,  either  the  condition  code  microoperation  at  CO  or  C2  are 
redundantly  executed  because  their  mating  branches  BO  and  B2  are 
located  on  mutually  exclusive  program  paths.  This  scheme  also 
inherits  the  register  saving  and  restoring  chores  required  by 
general  purpose  register  organizations  for  subroutine  and 
interrupt  execution.  Finally,  micr ooparation  formats  now  require 
an  additional  register  field  to  specify  the  condition  code 
register. 

To  fully  exploit  Feglster  Cond icioning ,  microcode  generation 
requires  routines  to  prudently  position  condition  code 
microoperations  and  assign  the  condition  codes  to  registers, 
.This  is  a  form  of  the  register  allocation  problem.  Because 
condition  code  reference  frequencies  are  expected  to  be  an  order 
of  magnitude  lower  than  for  the  corresponding  general  purpose 
register  case,  computational  difficulties  in  assigning  condition 
code  registers  are  expected  to  be  rare.  The  need  for  explicit 
condition  code  register  assignments  and  the  potential  need  for 
saving  and  restoring  their  contents  would  be  a  major  source  of 
microcode  incompatibility  between  adaptive  processors  having  a 
different  form  of  Conditioning.  Even  Register  Conditioning 
schemes  with  different  number  of  condition  code  registers  would 
not  be  completely  compatible. 
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3.2.4  Advanced  Conditioning 

As  can  be  inferred  from  the  preceding  descriptions  of 
Conditioning  schemes,  a  major  source  of  complexity  arises  in  the 
need  to  recouple  the  condition  code  to  its  mated  branch.  This 
complexity  is  sharply  reduced  in  A.dvanced  Conditioning  schemes. 
Advanced  Conditioning  is  a  term  that  encompasses  a  set  of 
conditioning  schemes  comparable  to  those  already  discussed,  but 
having  a  basic  difference.  *  In  Advanced  Conditioning,  the 
conditional  branch  is  decomposed  into  two  separate 
microoperations,  the  Predicate  microoperation  and  the  Marker 
microoperation.  The  Predicate  microoperation  depends  only  on  the 
condition  code  and  is  advanced,  in  tandem,,  with  the  condition 
code  microoperation  in  the  generated  microcode.  The  Marker 
microoperation,  containing  information  on  the  branch  target 
address  is  placed  in  the  traditional  delimiting  position,  at  the 
end  of  its  block. 

The  Predicate  microoperation  contains  the  information  for 
evaluating  a  predicate  based  on  the  outcome  of  its  mated 
condition  code  microoperation,  as  well  as  specifying  a  Predicate 
Register  assigned  to  it.  The  assignment  can  be  implicit  as  with 
Elocked  or  Nested  Conditioning  or  explicit  as  with  Register 
Conditioning.  The  Predicate  microoperation  would  be  placed 
immediately  before  its  condition  code  microoperation  in  the 
microcode.  This  positioning  obviates  the  need  for  explicitly 
marking  a  condition  code  microoperation  because  this  information 
can  be  obtained  by  context,  as  in  traditional  schemes.  The 
Predicate  microoperation  is  placed  before  the  condition  code 
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microoperation  to  minimize  the  nuisance  of  detecting  a  condition 
code  microoperation  when  it  is  separated  from  the  Predicate 
microoperation  by  a  physical  memory  block  boundary  or  by  the 
control  cycle  boundary. 

The  Marker  microoperation  specifies  the  branch  target  address 
and  the  Predicate  Register  whose  value  will  determine  the 
successor  block.  When  it  is  detected  in  the  Converger,  a  FORK 
operation  is  invoked  and  when  its  corresponding  predicate  value 
is  generated,  a  PROMOTE  on  the  selected  arm  could  be  initiated. 
The  Marker  microoperation  is  thus  an  unconditional  branch 
specifying  an  additional  data  input,  the  Predicate  Register. 

This  decomposition  of  the  conditional  branch  microoperation 
offers  three  significant  advantages  over  the  schemes  described 
earlier.  First,  the  microoperation  format  is  simplified  because 
no  appended  field  is  necessary  to  distinguish  condition  code 
microoperations.  Second,  by  incorporating  predicate  resolution 
logic  within  function  units,  the  predicate  value  can  be  generated 
concurrently  with  the  result.  Consequently,  resolving  a  branch 
successor  can  be  accomplished  earlier  than  in  the  previous 
schemes  which  sequentially  generate  a  condition  code  and  then 
evaluate  the  predicate.  For  example,  if  a  hardware  multiplier 
were  available  and  a  branch  depended  on  the  sign  of  a  product, 
the  predicate  value  could  be  made  available  within  a  cycle  of  the 
multiply  initiation.  Classical  schemes  wait  until  the  multiply 
terminates  before  initiating  predicate  resolution.  This 
difference  can  be  several  cycles,  depending  on  the  speed  of  the 
multiplier.  The  third  advantage  is  a  reduction  in  information 
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transferred  from  the  operation  unit  to  the  control  unit.  Instead 
of  transferring  several  bits  of  condition  code  information,  one 
bit  suffices  for  the  predicate  value.  A  Ready  bit  is  still 
necessary,  however,  to  synchronize  access  to  the  Predicate 
Register.  This  last  advantage  makes  a  scheme  that  can 

simultaneously  examine  a  sequence  of  predicate  values  more 

tractable.  The  cost  is  increased  complexities  in  the  function 

units,  a  cost  that  LSI  and  VLSI  promise  to  decimate. 

Generating  microcode  for  adaptive  processors  with  Advanced 
Conditioning  is  analogous  to  generating  microcode  for  adaptive 
processors  with  Blocked,  Nested,  or  Register  Conditioning, 

Appending  on  the  convention  used  to  specify  Predicate  Registers. 
Cne  carefully  moves  the  condition  code  microoperation  and 
Predicate  mic rooperat ion  as  far  forward  as  data  and  conditioning 
restrictions  allow.  (This  is  discussed  in  section  3.9.)  The 
microcode  of  an  operationally  similar  microprograramable  control 
unit  can  be  adc.pted  in  a  straightforward  manner  by  decomposing 
all  conditional  branch  microoperations,  positioning  the  Predicate 
and  Marker  microoperations  as  described  above.  Apart  from  this, 
microcode  compatabili ty  or  transportability  considerations  for 
Advanced  Conditioning  schemes  are  the  same  as  for  their 
previously  discussed  counterparts. 

3.2.5  Com_pa risen  of  the  Conditioning  Schemes 

The  characteristics  of  the  conditioning  schemes  are 
qualitatively  summarized  in  table  3.  1.  A  value  of  1  for  a 
characteristic  corresponds  to  the  best  possible  while  a  5 
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corresponds  to  the  worst  possible.  These  are  relative 
comparisons  only  within  rows.  Thus  column  sums,  corresponding  to 
conditioning  schemes,  do  not  provide  a  valid  comparison. 

Branch  resolution  speed  is  fastest  for  Register  Conditioning 
schemes  because  the  condition  code  microoperation  has  the  largest 
mobility  range.  Within  the  conditioning  schemes.  Advanced 
schemes  have  a  further  advantage  because  predicate  evaluation  is 
faster . 

The  complexity  of  the  logic  to  implement  the  conditioning 
schemes  is  grea-^est  for  the  Nested  schemes  because  dynamic 
register  assignment  and  stack  overflow  handling  logic  must, be 
provided.  Advanced  Conditioning  control  logic  is  simpler,  within 
the  conditioning  schemes,  because  the  routing  of  condition  codes 
to  a  predicate  evaluation  unit  is  not  necessary. 

Condition  code  microoperation  placement  is  most  difficult  for 
Register  Conditioning  schemes  because  of  the  greater  positioning 
opportunities  and  the  need  to  explicitly  assign  condition  code  or 
Predicate  Registers.  Registers  are  assigned  implicitly  in 
Blocked  and  Nested  schemes. 

Modularizability  is  the  ease  with  which  modules,  such  as 
subprograms  may  be  used.  It  also  reflects  interr uptabili ty ,  the 
ease  with  which  interrupts  can  be  serviced.  Register 
Conditioning  may  reguire  that  certain  registers  be  saved  when  a 
microcode  module  is  invoked  during  execution,  and  later  restored 
when  the  module  has  completed  execution.  This  must  be  done 
explicitly.  In  Nested  Conditioning,  this  is  performed 
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automatically  by  the  condition  code  stacking  hardware.  In 
Blocked  Conditioning,  this  could  also  be  done  by  a  hardware 
stack . 

Compatibility  with  traditional  microcode  is  least  for 
Register  Conditioning  schemes,  again  because  of  the  need  for 
explicitly  storing-restoring  predicate  registers  and  because 
condition  code  microoperation  placements  are  more  complex.  Ailso, 
the  requirement  of  two  microoperations  for  conditional  branches 
in  Advanced  Conditioning  detracts  from  compatibility. 

Register  Conditioning  executes  the  most  redundant 
microoperations  because  some  condition  code  microope rati or.s  are 
redundantly  executed,  depending  on  their  position  and  because  of 
the  store-restore  microoperations  not  as  extensively  required  for 
Nested  or  Blocked  Conditioning. 

Control  memory  usage  is  greatest  for  Register  Conditioning 
because  each  microoperation  requires  a  field  to  specify  a 
condition  code  register.  The  other  non-Advanced  schemes  require 
only  an  additional  bit  per  microoperation.  Advanced  Conditioning 
does  not  require  an  additional  field  per  microoperation  but  does 
require  an  additional  microoperation  for  each  conditional  branch. 
It  is  expected  that  the  Predicate  and  Marker  microoperations 
could  accommodate  a  Predicate  Register  specification  without 
exceeding  a  standard  microoperation  width. 
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3. 3  Adaptive  Processor  Implementation 

A  system  level  implementation  of  the  adaptive  processor  is 
described.  This  implementation  is  heavily  influenced  by  the 
microoperation  interpretation  structure  described  in  section 
3.1.1.  Because  this  structure  models  the  microprogram  generation 
process,  the  implementation  is  closely  related  to  the  classical 
microprogram  control  unit  in  an  evolutionary  sense.  The  adaptive 
processor  system  blocks  and  their  interrelationships  are  shown  in 
figure  3.4.  These  interrelationships  will  be  briefly  outlined 
here,  while  the  blocks  will  be  discussed  in  following  sections. 

The  activities  of  the  adaptive  processor  are  started  from  the 
NEXT  MICROPEOGRAM  input.  This  input  provides  the  initial  address 
of  a  microprogram  to  be  executed.  Parameters  for  the 
microprogram  are  passed  according  to  some  convention.  This  will 
not  be  discussed  here,  but  see  Husson  [HDS70]  for  some  methods  of 
how  this  is  done.  The  FETCH  NEXT  MICROPROGRAM  output  initiates  a 
.main  memory  fetch  for  the  next  microprogram  when  the  adaptive 
processor  is  ready  to  accept  it.  After  a  microprogram  has  been 
accepted.  Fetch  Control  begins  transmitting  microo pera tiori  s  to 
the  Buffer  Decoder.  The  primary  source  of  microo peration s  is  the 
Cache,  the  secondary  source  is  Control  Memory.  Control  Memory 
transfers  words  to  Cache  which  can  supply  microoperations  at  a 
higher  rate  to  the  Buffer  Decoder.  The  Buffer  Decoder  accepts 
microoperations  in  parallel  and  then  serializes  the 
microoperations  into  a  stream  that  is  passed  to  the  Converger. 
As  the  stream  is  serialized,  the  microoperations  are  decoded  and 
converted  to  a  standardized  format  that  is  convenient  to  the 
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blocks  further  down.  Stream  control  microoperations  are  detected 
here  and  this  information  is  passed  to  Fetch  Control  to  indicate 
block  terminations.  Fetch  Control  then  reacts  accordingly  by 
determining  the  successor  blocks  and  arranges  for  their  transfer 
from  either  Cache  or  Control  Memory  to  the  Buffer  Decoder.  The 
Converger  buffers  the  ?lCTIVE  block  and  its  successors, 
transmitting  the  appropriate  microoperation  streams  to  the 
Issuer.  The  Issuer  has  the  capability  to  examine  a  number,  SR, 
of  microoperations  each  cycle  for  possible  issue  to  the  Control 
Buffer  for  execution  initiation  in  the  next  cycle.  SR  is  called 
the  scan  rate  of  the  Issuer  and  is  an  important  parameter  of  the 
adaptive  processor.  If  a  microoperation  cannot  be  issued,  the 
microoperation  is  delayed  and  returned  to  a  delay  gueue  in  the 
Converger.  From  the  Control  Buffer,  one  of  the  three  composed 
microinstructions  is  initiated  for  execution  at  the  beginning  of 
the  next  control  cycle.  The  initiated  microoperations  are 
executed  in  the  operation  unit  which  contains  data  memory,  work 
registers  and  function  units.  If  a  condition  code  setting 
microoperation  completes  execution,  the  status  information  is 
transmitted  to  the  Predicate  Evaluator  which  proceeds  to  evaluate 
the  predicate  specified  by  the  active  conditional  branch.  Note 
that  branches  are  not  executed  in  the  operation  unit  but  in  the 
Stream  Controller.  The  predicate  value  is  transmitted  to  the 
system  blocks  shown  in  the  diagram  where  the  system  blocks  adjust 
their  activities  accordingly- 

V  # 

It  is  evident  that  the  path  length  of  a  microoperation  from 
Control  Memory  to  the  Control  Buffer  has  been  greatly  increased 
in  the  adaptive  processor.  However,  response  is  not  degraded 
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because  the  feedback  path  of  the  operation  unit  status  to  the 
control  unit  is  no  longer  than  on  classical  machines.  As  long  as 
the  Converger  can  maintain  a  continuous  stream  of  microoperations 
to  the  Issuer  and  the  Issuer  scan  rate,  SE,  is  sufficiently 
large,  adaptive  processor  performance  will  compare  favourably 
with  the  performance  of  classical  microprogram  control  units. 

To  provide  some  indication  of  the  relative  timing  of  events, 
we  briefly  describe  the  timing  within  the  execution  cycle  or  scan 
cycle.  These  terms  are  synonymous  but  will  be  used  in  their 
respective  contexts  -  the  execution  of  microoperations  in  the 
operational  unit  or  the  scanning  of  mic rooperations  by  the 
fssuer.  In  addition,  control  cycle  may  also  be  used  to  refer  to 
these  cycles.  An  execution  or  scan  cycle  are  subdivided  into  a 
number  of  equal  phases.  The  first  and  last  phase,  denoted  by  R 
and  W,  respectively,  have  special  significance  for  the  execution 
of  microoperations.  During  the  R,  or  read,  phase,  source  values 
are  gated  from  work  registers  to  the  function  units.  During  the 
W,  or  write,  phase,  result  values  are  gated  into  the  destination 
work  registers.  The  execution  of  microoperations  proceeds  as  on 
conventional  microprogrammabl e  control  units.  A  microinstruction 
is  initiated  in  the  first  phase  of  each  execution  cycle.  This 
phase  consists  of  reading  values,  or  equivalently  gating  sources 
to  function  unit  inputs.  This  phase  is  denoted  by  R.  The 
execution  of  the  microoperation  then  proceeds  and  may  extend  over 
several  execution  cycles.  After  a  result  is  generated  by  a 
function  unit,  it  is  usually  written  or  deposited  into  a  work 
register.  This  occurs  in  the  last  phase,  W,  of  the  execution 
cycle . 
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A  microinstruction  is  composed  by  the  Issuer  concurrently 
with  the  execution  of  the  previously  composed  microinstruction 
during  what  is  called  a  scan  cycle.  The  scan  cycle  is 
superimposed  on  the  execution  cycle  and  a  microoperation  is 
examined  for  issue  in  each  phase  of  the  scan  cycle.  An  exception 
may  be  the  last  phase  which  coincides  with  the  W  phase  of  the 
control  cycle.  The  depositing  of  result  values  in  the  W  phase 
can  interfere  with  the  verification  of  issue  conditions. 

The  system  block  descriptions  will  include  their  function, 
various  hardware  structures  that  can  be  used  for  their 
implementation,  and  performance  parameters  that  affect  the 
microoperation  flow  rate  through  the  system  blocks. 

The  performance  parameters  will  be  related  to  the  speed 
objecuives  of  the  adaptive  processor.  A  design  goal  is  the 
capability  to  initiate  the  same  number  of  microoperations/cycle 
as  an  operationally  equivalent  maxinal  microprogrammable  central 
processing  unit.  As  a  basis  for  timing  comparisons,  the  ECL 
logic  in  the  Motorola  M10000  series  will  be  used.  One  example  of 
a  central  processing  unit  using  the  Motorola  M10800  LSI  family 
has  a  typical  cycle  time  of  61.5  nsec  [BLO].  With  a  typical  ECL 
gate  delay  of  2  nsec,  this  cycle  time  is  equivalent  to  31  gate 
delays.  This  figure  will  be  used  as  a  measure  of  the 
microoperation  throughput  of  a  given  block. 

Using  the  gate  delay  as  our  unit  of  time,  we  note  a 
fundamental  speed  limitation  on  information  flow  through  logic 
nets.  If  a  signal  is  used  to  set  a  latch,  then  its  minimum  width 
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must  be  two  gate  delays, 
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gate  delay  cycle,  this  speed  limitation  imposes  an  effective 
upper  bound  on  the  Issuer  serial  scan  rate  of  about  10 
microoperations/cycle.  The  scan  rate,  however,  can  be  increased 
by  examining  microoperations  in  parallel  during  each  phase.  The 
cost  is  additional  information  lines  and  circuitry.  The  Issuer 
organization  to  be  described  here  examines  only  one 
microoperation  in  each  phase.  Although  more  exacting  performance 
specifications  may  require  a  multi-microoperation/phase 
organization,  the  essence  and  complexity  of  adaptive  processor 
operation  are  sufficiently  conveyed  by  the  simpler  version. 


In  the  following  sections,  fairly  liberal  use  of  hardware  is 
suggested  to  meet  desired  performance  levels.  In  particular, 
parallelism  by  replication  of  various  hardware  components  is 
suggested.  Such  design  strategies  in  the  past  have  been  in  the 
exclusive  domain  of  super  computer  designs.  Currently,  many 
minicomputers  take  advantage  of  cache  memories,  pipelining,  and 
virtual  storage  organizations  to  improve  either  their  performance 
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or  their  usability.  This  appears  to  be  a  continuing  trend, 
upgrading  computer  design  with  increasingly  complex  hardware. 
Our  use  of  hardware  structures  should  be  made  in  the  light  of  the 
predictions  by  Noyce  [N0Y76]  and  Shepherd  [SHE77]  on  VLSI 
capabilities  in  the  mid-80*s.  They  predict  a  computer  on  a  chip 
with  a  megabit  of  memory.  Such  awesome  capabilities  will 
mitigate  the  hardware  excesses  of  the  adaptive  processor. 

3.3.1  Control  Memory 

Control  Memory  can  supply  a  word  of  microoperations  each 
cycle  to  the  Stream  Controller.  It  is  controlled  by  Fetch 
Control  which  determines  the  address  of  the  microoperations  tc  be 
fetched.  The  flow  rate  of  microoperations  from  Control  Memory 
can  be  increased  by  increasing  the  block  or  access  width.  As  ECL 
cycle  times  for  memory  chips  of  10  gate  delays  are  possible. 
Control  Memory  can  be  easily  accessed  within  a  control  cycle  of 
31  gate  delays.  Thus  Control  Memory  should  be  capable  of 
delivering  a  suitable  flow  rate. 

3.3.2  The  Stream  Controller 

The  Stream  Controller  implements  the  major  portion  of  the 
provision  phase,  its  function  being  to  provide  a  continuous 
stream  of  microoperations  to  the  Issuer.  This  function  requires 
the  smoothing  of  interruptions  to  the  stream  flow.  We  will 
describe  some  possible  interruptions  and  the  extent  to  which  they 
are  overcome.  Control,  consisting  of  Fetch  Control  and  Predicate 
Evaluator,  provides  the  directives  to  guide  microoperations 
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through  the  Cache,  Buffer  Decoder  and  Converger  to  the  Issuer. 
Control  regulates  according  to  the  control  subpolicies  of  the 
provision  phase. 

The  Stream  Controller  concerns  itself  only  with 
microoperation  stream  flow,.  Interruptions  due  to  data  arrivals 
from  data  memory  are  not  considered  here.  The  interruptions  we 
consider  occur  at  block  and  microprogram  interfaces.  The  former 
interruptions  occur  because  of  the  decision  time  required  to 
determine  the  successor  of  the  current  block.  The  latter  occur 
in  effecting  the  transition  to  the  successor  microprogram.  In 
this  respect  we  assume  that  some  higher  level  stream  controller 
provides  a  continuous  stream  of  microprogram  requests  to  the 
adaptive  processor.  Consequently,  the  Stream  Controller 
considers  only  the  effects  of  stream  control  microoperations 
causing  conditional  and  unconditional  block  transitions  and  the 
effect  of  microprogram  invocations. 

Microprogram  invocations  are  handled  by  Fetch  Control.  The 
end  of  a  microprogram  is  demarked  by  an  End-of- Microprogram 
microoperation.  Its  detection  at  any  of  several  points  within 
the  Stream  Controller  can  be  used  to  prefetch  the  next 
microprogram.  The  detection  of  the  End-of- Microprogram 
microoperation  in  the  Buffer  Decoder  could  initiate  the 
transmission  of  the  next  microprogram  into  the  Converger. 
Consequently,  transition  of  control  to  a  new  microprogram  will 
proceed  smoothly  after  the  starting  address  has  been  determined. 
The  initial  block  of  the  prefetched  microprogram  is  concatenated 
to  the  terminal  block  of  the  currently  executing  microprogram. 
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The  execution  of  an  unconditional  branch  microoperation 
produces  a  similar  effect.  The  target  address  can  be  computed  by 
Fetch  Control  upon  detection  of  the  unconditional  branch  and  the 
fetching  of  the  target  block  can  begin  in  the  following  cycle. 
The  target  block  microoperations  are  concatenated  to  the 
predecessor  microoperations.  Again,  the  transition  due  to  an 
unconditional  microoperation  can  be  effected  smoothly  after  its 
detection.  As  with  the  End- of- Microprogram  microoperation, 
unconditional  branch  microoperations  can  be  detected  when  a 
control  word  is  transmitted  to  the  Cache  or  Buffer  Decoder. 

Conditional  branch  microoperations  pose  a  more  severe 
problem.  Although  the  branch  target  address  can  be  computed  as 
with  unconditional  branches,  the  successor  block  cannot  be 
determined  until  the  predicate  specified  by  the  conditional 
branch  is  evaluated.  To  guarantee  immediate  access  to  the 
selected  successor  block,  the  initial  microcode  segments  of  both 
successors  must  be  prefetched.  The  mechanics  and  the  extent  of 
the  necessary  prefetch  are  discussed  in  the  description  of  the 
Converger.  The  degree  of  disruption  due  to  the  wait  for  a 
predicate  value  will  depend  on  both  the  availability  of  successor 
microoperations  and  on  the  wait  for  the  generation  of  the 
condition  code  requested  for  the  predicate.  The  wait  for  the 
condition  code  is  a  function  of  the  stream  distance  the  condition 
code  microoperation  can  be  percolated  ahead  of  its  branch  and  on 
its  execution  time.  This  has  been  discussed  in  the  section  3.2 
on  Conditioning.  The  Predicate  Evaluator  contained  in  Control 
performs  the  predicate  computation  specified  by  the  branch  when 
the  condition  code  generated  in  the  operation  unit  is  passed  to 
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it*  The  predicate  value  is  then  transmitted  to  Fetch  Control, 
Buffer  Decoder,  Converger  and  Issuer  where  necessary  stream 
routing  action  is  undertaken* 

Summarizing,  we  can  see  that  the  effectiveness  of  the  Stream 
Controller  depends  on  its  ability  to  solve  stream  routing 
problems  at  their  earliest  after  detection.  The  transitions  to 
the  next  microprogram  and  to  successor  blocks  through 
unconditional  branches  pose  no  serious  problems.  On  the  other 
hand,  the  resolution  of  conditional  branches  is  program 
dependent.  The  Stream  Controller  must  prefetch  at  least  the 
initial  successor  segments  and  resolve  the  predicate  as  early  as 
possible. 

3.3.3  The  Converger 

To  maintain  a  continuity  in  stream  flow  to  the  Issuer,  the 
Stream  Controller  must  have  the  ability  to  look  ahead  in  the 
microprogram  to  plan  for  the  possible  eventualities  that  might 
cccur  during  the  microprogram  execution.  It  does  so  in  the 
Converger,  linking  together  microoperations  into  stream 
tributaries,  forming  a  dynamic  program  structure.  This  program 
structure  is  basically  a  binary  tree  whose  nodes  are  straight 
line  sequences  of  microoperations  that  are  terminated  by  (two 
way)  conditional  branches.  Figure  3.5  illustrates  such  a 
structure  and  some  useful  terminology  associated  with  it. 

I 

Nodes,  consisting  of  blocks  of  microoperations,  will  be 
called  arms*  The  arms  reside  on  a  level*  The  level  0  arm,  or 
root,  is  called  the  active  arm  and  contains  active 
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microoperat ions  that  must  eventually  be  executed.  Arms  in  a 
higher  numbered  level  are  called  conditional  arms.  The  execution 
of  microoperations  in  these  arms  is  conditional  on  the  predicate 
values  of  predecessor  branches.  The  level  number  is  the  number 
of  binary  predicates  that  must  be  evaluated  before 
microoperations  in  that  level  can  become  active.  The  labels 
attached  to  these  arms  indicate  the  chain  of  values  that  the 
preceding  predicates  must  attain  for  that  arm  to  be  activated. 


LEVEL 


Figure  3.5  Program  Structure 

To  dynamically  maintain  the  program  structure.  Control  uses 
the  three  high  level  operations  defined  below: 
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CBEATE: 

The  active  arm  is  created.  CEEATE  is  invoked  when  the 
Converger  is  empty  and  a  block  of  microoperations  is  to 
be  passed  to  the  Issuer. 

F0RK(5C,  ..X^)  ; 

This  operation  is  invoked  when  a  conditional  branch 
enters  node  X^X2...Xi  (X3  =  F  or  T,  0< j<i)  .  This 

completes  the  construction  of  arm  X-,X^..X^  and 
initializes  arms  X,  X^,  . . .  Xj^  F  and  X^X^...Xj^T. 

FEOMOTE(X,  X^...Xi)  (Xj  =T  or  F,  1<j<i); 

This  operation  is  invoked  when  the  predicate  for  node 
X-,  X^  •  . .  Xj^.1  is  evaluated  and  attains  value  Xj^  •  This 
merges  the  microoperations  in  Xt  X^  . . .  X^.^  and  X^X^.-.X^, 
promoting  the  subtree  rooted  by  X-jX^.-.X^^  one  level. 
The  effect  is  to  relabel  all  promoted  nodes  by  removing 
X^.-i  and  to  delete  the  subtree  rooted  by  X^  X2..Xj^. 

Note  that  in  an  adaptive  processor,  the  promotion  of  a  node 

at  any  level  is  possible.  This  is  a  consequence  of  unbinding  the 

condition  code  setting  microoperation  from  the  branch,  allowing 

the  microoperation  to  'percolate*  to  its  highest  level  and 

advance  its  execution  initiation.  In  conventional  control  units, 

promotion  is  limited  to  only  the  T  or  F  arms. 


A  question  that  naturally  arises  is  *  How  many  levels  should 
be  provided  in  the  Converger?*.  We  offer  a  rudimentary 
cost/benefit  analysis  to  provide  some  insight.  Clearly,  cost  of 
memory  for  storing  the  microoperations  and  circuitry  to  provide 
the  control  and  interconnections  can  increase  exponentially  with 
the  number  of  levels.  And,  in  the  worst  case,  to  maintain  the 
supply  of  microoperations  to  the  arms  in  the  i * th  level,  the  flow 
rate  from  Cache  and  Control  Memory  to  the  Converger  must  be  2 
times  the  flow  rate  through  the  active  arm. 

f 

The  benefit  of  providing  i+1  levels  is  that  it  allows  the 
adaptive  processor  to  maintain  performance  equal  to  conventional 
microprogram  control  units  with  a  2  -way  branch  capability  in  the 
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face  of  a  .2^ -way  branch  situation.  Such  a  worst  case  situation 
occurs  when  all  arms  at  level  i-1  or  lower  contain  only  a  pending 
conditional  branch.  If  all  the  predicates  in  a  chain  of  i-1 
branches  are  simultaneously  resolved,  then  the  microoperations  in 
the  selected  arm  can  be  passed  to  the  Issuer  in  the  following 
cycle.  The  stream  would  not  suffer  due  to  the  disruption  of 
branch  resolutions.  If,  on  the  other  hand,  only  i-1  levels  were 
maintained,  in  this  case,  branch  resolution  would  occur  as 
before.  However,  there  would  be  no  microoperations  available  for 
the  Issuer  in  the  next  cycle  because  no  level  i  arm  was 
maintained.  In  this  case,  a  fetch  to  Cache  or  Control  Memory 
must  be  initiated  after  branch  resolution  and  this  access  delay 
would  contribute  directly  to  the  execution  time  of  the 
microprogram.  This  delay  would  be  directly  attributable  to  the 
described  hardware  shortcoming  of  the  Converger. 

This  multiway  branch  inadequacy  of  adaptive  processors  can  be 
examined  from  a  more  pragmatic  perspective.  In  practice  [IIUS70], 
this  capability  is  used  mostly  to  decode  the  machine  instruction 
opcodes  and  consequently  its  occurrence  frequency  is  relatively 
high  [BBL73].  Opcodes  can  also  be  decoded  using  table  look-up  to 
a  dedicated  ROM,  an  activity  that  can  be  easily  overlapped  with 
the  execution  of  the  previous  instruction  on  an  adaptive 
processor.  Consequently,  this  high  frequency  usage  of  multiway 
branches  is  eliminated.  In  section  4.4  a  quantitative  analysis 
on  Converger  dynamics  is  given. 
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3.3,4  The  Window 

An  important  subcomponent  of  the  Converger  is  the  Window. 
The  Window  contains  the  actual  microoperations  that  can  be 
transmitted  to  the  Issuer.  It  consrsts  of  the  lower  two  levels 
of  the  Converger  -  the  A,  T  and  F  arms.  Its  chief 
responsibilities,  in  addition  to  those  assigned  to  the  Converger, 
are  to 

1 .  Provide  successive  microoperations  to  the  Issuer  for 
issuing  to  the  Control  Buffers, 

2,  Retain  and  reposition  delayed  micr  cope  rati  or.  s  for  re¬ 
examination  in  the  next  scan  cycle. 

To  provide  for  the  storage  of  microoperations,  a  queue  set  is 
allocated  to  each  of  the  considered  microoperation  arms  A,  T,  and 
F.  Each  queue  set  consists  of  three  FIFO  queues:  an  INITIAL 
QUEUE,  (IQ),  a  DELAY  QUEUE  (DQ)  ,  and  a  DELAY  BUFFER  ( DE)  .  Figure 
3.6  shows  the  possible  flow  paths  of  a  microoperation  through  the 
Window.  The  IQ  is  used  to  store  incoming  microoperations  from 
the  Converger  arms  above  level  1.  The  DQ  'stores  those 
microoperations  that  have  been  rejected  for  issue  in  the  current 
cycle.  Because  the  Issuer  requires  n  scan  phases  (n  is  the 
number  of  scan  cycle  phases  required  to  check  the  issue 
conditions  of  a  microoperation  and  is  dependent  on  the  Issuer 
implementation)  to  determine  whether  or  not  a  microoperation  is 
issued,  a  DELAY  BUFFER  of  n  cells  must  be  used  to  retain  a 
transmitted  microoperation  until  a  decision  is  reached.  If  the 
microoperation  is  delayed  it  is  transmitted  to  the  DQ;  otherwise 
it  is  deleted  from  the  DB. 


Figure. 3,6  Windov  Data  Paths 
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Microoperations  are  transmitted  to  the  Issuer  each  scan  cycle 
in  the  following  stream  order; 

1  .  ACTIVE 

2.  TPUE,  FALSE, 

ACTIVE  microoperations  are  sent  to  all  Issuer  subparts  until  no 
ACTIVE  microoperations  remain  or  until  the  scan  cycle  completes. 
After  all  ACTIVE  microoperations  have  been  transmitted,  the  T  and 
F  microoperations  are  simultaneously  sent  to  their  respective 
Issuer  subparts  in  the  remaining  phases  of  the  scan  cycle. 

Within  each  queue  set,  the  treatment  of  micr oopera tioa s  is 
somewhat  different,  depending  on  whether  the  ACTIVE  stream  or  a 
conditional  stream  is  being  issued.  For  the  ACTIVE  stream, 
microoperations  are  transmitted,  in  order,  from  the  BQA  and  then 
the  IQA.  The  microoperations  transmitted  from  the  DQA  are  those 
that  had  been  delayed  in  the  previous  scan  cycle.  Those  being 
delayed  in  the  current  cycle  are  not  reanalyzed  until  the 
following  cycle.  After  the  DQA  has  passed  all  of  'its  eligible 
microoperations  to  the  Issuer  the  IQA  begins  to  transmit  its 
microoperations  in  the  remaining  phases.  Any  incoming 
microoperations  directed  to  this  queue  set  while  either  IQA  or- 
DQA  are  transmitting  are  placed  in  the  IQA.  They  may  be 
transmitted  in  the  current  scan  cycle. 

In  a  queue  set  buffering  a  conditional  stream,  the 
microoperations  it  transmits  to  its  Issuer  subpart  come' from  the 
IQ.  Microoperations  that  are  delayed  in  this  cycle,  as  before, 
are  passed  to  the  DQ  if  they  cannot  be  conditionally  issued. 
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However/  if  at  the  beginning  of  the  next  cycle/  the  predicate 

value  has  not  been  determined/  the  DQ  entries  are  cancelled  and 

conditional  microoperations  are  retransmitted  from  the  IQ  in 

their  original  order.  Retransmission  is  necessary  because  the 

/ 

issuing  of  delayed  ACTIVE  microoperations  in  this  cycle  may 
establish  a  system  state  that  will  reject  a  conditional 
microoperation  that  was  issued  in  the  previous  cycle. 

If  the  predicate  is  evaluated  and  the  contents  of  the 
conditional  queue  set  are  activated,  transmission  to  the  ACTIVE 
port  of  the  Issuer  commences  from  the  newly  activated  queue  set 
after  the  ACTIVE  queue  set  completes.  Any  microoperations 
delayed  at  this  point  are  transferred  to  the  DO  of  the  ACTIVE 
queue  set.  'Incoming  ACTIVE  microoperations  are  routed  to  either 
the  ACTIVE  IQ  if  it  has  been  emptied  of  microoperations  from  the 
predecessor  stream;  otherwise/  to  the  IQ  of  the  activated  queue 
set.  If  the  predicate  is  evaluated  and  does  not  agree  with  the 
value  assigned  to  a  conditional  queue  set,  then  the  queue  set 
contents  are  cancelled. 

This  section  discusses  Window  parameter  values  for  the 
organization  described  above.  In  particular/  the  necessary  queue 
capacities  to  meet  performance  goals  and  the  Window  transmission 
rate  t  under  the  possible  operating  conditions  is  examined. 

I 

The  capacity  of  the  DB  has  been  discussed  in  the  previous 
section  and  is  n  cells.  This  is  the  required  number  of  scan 
cycle  phases  to  determine  the  disposition  of  a  microoperation 
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transmitted  to  the  Issuer.  Because  the  DB  is  functionally  an 

appendage  to  the  DQ ,  the  combined  capacity  of  the  DQ-DB 

combination  is  considered.  If  the  Issuer  scan  rate  is  SR 

microoperations  per  cycle,  a  minimum  of  SR  cells  is  required  by 

/ 

the  DQ-DB  combination  to  prevent  a  buffer  overrun.  This  is  also 
the  maximum  useful  capacity  of  DQ-DB. 

The  IQ  also  has  a  minimum  capacity  of  SR  cells.  This 
capacity  is  necessary  to  accomodate  the  maximum  Issuer  scan  rate 
when  the  DQ  is  empty  at  the  beginning  of  a  scan  cycle.  Maximum 
size  considerations  include  the  expected  block  size  and  the 
frequency  of  arm  promotion  where  the  promoted  arm  is  merged  with 
its  predecessor. 

The  window  transmission  rate,  t,  is  the  maximum  number  of 
microoperations  that  can  be  transmitted  to  an  Issuer  input  port 
in  one  control  cycle.  To  describe  the  possible  queue  states  at 
the  beginning  of  a  scan  cycle  that  can  affect  t  these  variables 
are  used: 

SR:  the  number  of  phases  in  a  scan  cycle 

t:  the  maximum  transmission  rate  to  the  Issuer 

p:  the  number  of  phases  in  a  scan  cycle 

n:  the  number  of  phases  required  to  determine  a 
microoperation  issue  disposition 

q,d:  the  number  of  microoperations  initially  in  the  IQ 
and  DQ  respectively  (0<q<p,  0<d<p-n)  . 

b:  the  jssown  number  of  microoperations  in  DE  at  the 
beginning  of  a  scan  cycle. 

0<b<n 

b  =  <)  is  used  to  indicate  that  DB  is  known  to  be 
initially  empty. 

m:  Window  input  rate  of  microoperation  stream  in 
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microoperation  s/cycle. 

There  are  two  classes  of  queue  states.  Initial  and  Dynamic, 

The  Initial  state  occurs  when  all  queues  in  the  queue  set  are 

empty.  The  state  is  described  by  (q=0,  d  =  0,  b=())  •  Otherwise 

/ 

Dynamic  States  with  (x<q<p,  y<d<p-n,  z<b<n)  ,  x+y+z>1  occur.  For 
the  Initial  state 

0  <  t  <  MIN  {m,SR-1} 

The  limit  SR-1  occurs  because  microoperations  must  first  pass 
through  the  IQ  before  they  are  transmitted  to  the  Issuer  and  a 
one  scan  cycle  phase  delay  is  incurred.  For  a  Dynamic  state, 
n>b+d  then  n- (b+d)  of  the  initial  phases  cannot  be  used  for 
transmission.  These  idle  phases  are  required  to  establish  the 
potential  existence  of  a  delayed  micr ooperation  examined  by  the 
Issuer  in  the  last  n  phases  of  the  previous  cycle.  Thus 

t  =  q  +  d  +  m  for  1<g+d+m<SR-n 

SP-n+Min {n , b+d}  for  SR-n  <  g+d+m 

The  above  relation  demonstrates  the  importance  of  minimizing  the 
Issuer  response.  In  the  case  where  no  microoperations  are 
delayed,  b=d=0  and  t=SR-n,  i. e. ,  the  effective  scan  rate  is 
reduced  from  the  maximum  scan  rate  by  n. 

3.3.5  Micro ope rat ion  Cache 

Cache  is  the  primary  source  of  microoperations  and  is  backed 
up  by  Control  Memory.  It  is  loaded  with  control  words  before  a 
new  microprogram  is  to  begin  executing  and  transfers  control 
words  to  the  Buffer  Decoder  as  directed  by  Fetch  Control.  its 
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transmission  rate  to  the  Buffer  Decoder  must  be  sufficiently  high 
to  provide  microoperations  to  the  conditional  streams  supported 
by  the  Converger. 

The  two  transmission  operations  of  the  Cache  are  the  Read 
which  delivers  a  word  to  the  Buffer  Decoder  and  Write,  which 
accepts  a  word  from  Control  Memory.  Table  3.2  shows  a 
segmentation  of  the  activities  for  a  Read  and  Write.  Figure  3.7 
illustrates  a  Cache  organization  that  could  be  pipelined  in  3 
sections  to  provide  multiple  accesses  per  execution  cycle. 

MM  WRITE 

1.  Transmit  word  address  1.  Transmit  word  address 

2.  Search  ADDRESS  MEMORY  2.  Generate  DATA  MEMORY  address- 

3.  Access  DATA  MEMORY  3.  Write  word  in  DATA  MEMORY 

Table  3.2  Segmentation  of  Cache  Activities 

The  pipeline  sections  follow  the  segmentation  outlined  in 
table  3.2.  For  the  Read,  Fetch  Control  transmits  an  address  that 
is  latched  into  the  READ  ADD  register.  This  initiates  a  search 
'in  ADDRESS  MEMORY,  an  associative  memory  that  maps  Control  Memory 
addresses  into  DATA  MEMORY  addresses  and  determines  if  the 
requested  control  word  resides  in  Cache.  The  resulting  DATA 
MEMORY  address,  if  found,  is  latched  into  the  CacheADD  register 
and  a  DATA  MEMORY  access  is  initiated.  The  fetched  control  word 
is  gated  into  the  READ  DATA  register.  From  there,  it  i s  passed 
to  the  Buffer  Decoder. 

If  the  requested  control  word  is  not  resident  in  Cache,  its 
address  is  transmitted  to  the  CMAR  and  the  WRITE  ADD  registers. 
After  the  word  has  been  fetched  in  the  following  cycle  it  is 
transferred  to  the  WRITE  DATA  register  and  entered  into  DATA 
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MEMORY.  The  control  word  is  also  transmitted  to  the  Buffer 

Decoder.  Note  that  during  the  Control  Memory  access  the  Reading 

of  Cache  can  continue.  By  replacing  the  CMAR  and  WRITE  ?lDD 

registers  with  FIFO  queues,  several  Writes  can  be  pending  without 

/ 

undue  obstruction  of  Reads., 

The  required  speed  of  the  components  of  Cache  will  depend  on 
the  transmission  requirements  to  the  Buffer  Decoder.  For 
example,  four  Reads  and  one  Write  could  supply  four  conditional 
streams  to  the  Converger  while  permitting  control  words  of  the 
next  microprogram  to  be  prefetched.  Using  our  timing  basis 
described  in  section  3.3,  the  pipeline  section  delay  maximum  must 
bo  less  than  6  gate  delays.  This  requirement  should  be 
attainable  with  ECL  logic. 

The  effectiveness  of  Cache  will  depend  to  a  large  extent  on 
the  locality  [MAT75]  of  the  microprogram  and  on  the  capacity  of 
Cache.  These  parameters  directly  affect  the  Cache  Hit-Ratio 
which  is  defined  as  the  ratio  of  the  number  of  successful  Cache 
accesses  to  the  total  number  of  Control  Memory  acces'ses#  If  the 
locality  changes  slowly,  the  Hit-Ratio  will  be  high  and  the 
microcperation  supply  to  the  Converger  can  be  adequately 
maintained.  If  locality  changes  rapidly  and  there  are  many 
conditional  branches,  the  Converger  supply  may  be  inadequate  to 
cope  with  the  interruptions.  This  would  cause  many  accesses  to 
Control  Memory  after  a  predicate  resolution,  subjecting  the 
microoperation  stream  to  the  delays  associated  with  the  passage 
through  the  adaptive  processor  blocks.  Such  delays  would  not  be 
overlapped  with  execution  and,  consequently  would  contribute  to 
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the  execution  time  of  the  microprogram.  In  these  cases  of  rapid 
locality  change  and  high  conditional  branch  incidence,  only  a 
reduction  in  Control  Memory  access  time  would  alleviate  this 
situa  tion • 

/ 

Experiments  on  memory  hierarchies  show  that  the  Hit-Eatio  of 
the  fastest  memory  tends  to  increase  as  its  capacity  is 
increased.  We  would  expect  similar  improvements  with  an  increase 
to  Cache  capacity.  This  also  would  depend  on  the  locality  of 
microprograms.  Simulation  experiments  can  be  used  to  determine 
*good*  sizes  for  Cache  capacity  and  block  size  for  various 
microprog  rams. 

The  main  reason  for  the  Cache  is  to  provide  multiple  prefetch 
streams  to  the  Converger  arms  each  cycle.  An  alternate 

organization  might  be  to  provide  a  separate  Control  Memory  with 
identical  contents  for  each  prefetch  stream  required.  Again,  LSI 
and  VLSI  make  such  a  scheme  viable. 

3.3.6  The  Buffer  Decoder 

The  Buffer  Decoder  has  two  functions.  The  first  function  is 
to  accept  a  control  word  and  to  serialize  the  contained 
microoperations  into  a  stream  that  enters  one  of  the  Converger 
arms.  We  assume  that  the  microoperations  are  correctly  ordered 
within  the  control  word.  The  second  function  is  to  decode  the 
microoperations,  detecting  branches  and  conveying  the  necessary 
information  to  Fetch  Control.  In  addition,  the  second  function 
could  include  reformatting  the  microoperations  into  a  standard 
format  making  some  of  the  microoperation  information  explicit. 
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This  will  facilitate  the  functions  of  the  Converger  and  other 
system  blocks  further  downstream. 

The  Buffer  Decoder  consists  of  several  independent  sections, 
each  section  being  capable  of  transmitting  micrcoperation s  to  one 
of  the  Convergei  arms.  Each  section  has  a  parallel-in,  serial- 
out  buffer  for  microoperations  and  a  decoding  network  that 
detects  stream  control  microoperations.  Fetch  Control  transfers 
a  control  word  to  a  section  and  its  microoperations  are  passed  in 
sequence  to  the  specified  arm  in  the  Converger,  Whenever  a 
stream  control  microoperation  is  detected  by  the  decoder, 
microoperation  transmission  to  the  Converger  arm  ceases  and 
stream  control  information  is  passed  to  Fetch  Control.  The 
stream  control  microoperation  terminates  the  block  being 

transmitted  and  the  section  waits  for  a  new  directive  from  Fetch 
Control.  Transmission  can  also  be  terminated  if  the  Converger 
arm  is  deactivated  by  a  predicate  evaluation. 

The  number  of  sections  in  the  Buffer  Decoder  should  be 

matched  to  the  number  of  conditional  streams  being  accepted  by 
the  Converger,  and  to  the  microoperation  flow  rate  from  Cache  and 
Control  Memory.  Considerations  in  determining  the  number  of 
conditional  streams  in  the  Converger  and  the  flow  rate  of  Cache 
have  been  examined  in  previous  sections.  Each  control  word  from 
Cache  is  placed  on  the  MEMORY  OUT  bus  and  its  destination  to  a 
Buffer  Decoder  section  is  determined  by  Fetch  Control.  The 
serial  output  stream  of  the  section  is  directed  to  the 

appropriate  stream  input  of  the  Converger. 
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The  transmission  function  of  the  Buffer  Decoder  could  be 
based  on  a  parallel-in,  serial  out  shift  register  with  a  high 
enough  shift  rate.  Another  possibility  could  be  an  N-field 
parallel  input  register  whose  fields  could  be  sequentially 
multiplexed  onto  a  Converger  input  bus.  The  decode  function, 
because  there  are  only  a  limited  number  of  stream  control 
microoperations,  could  be  implemented  using  a  brute  force,  two 
level  logic  network. 


3.3,7  Control 

Tlie  description  of  Control  functions  has  been  distributed 
throughout  the  sections  describing  the  functioning  of  the  Stream 
Controller.  They  will  be  summarized  here.  As  noted  before,  the 
functions  ,are  affected  by  the  available  hardware  used  to 
implement  the  Stream  Controller.  The  functions  have  been  broadly 
classified,  into  fetch  control  and  predicate  evaluation 
activ ities. 


Fetch  Control  performs  the  following  activities: 


1.  It  determines  the  addresses  of  control  words  that  are  routed 
to  the  sections  of  the  Buffer  Decoder.  It  must  have  a 
program  counter  for  each  possible  Converger  arm  that  is  being 
filled  with  microoperations.  This  capability  must  include 
the  generation  of  the  initial  address  of  a  microprogram  and 
the  target  address  of  a  branch. 

2.  For  each  conditional  arm  of  the  Converger,  it  maintains  a 
specification  of  which  predicate  determines  its  promotion. 
The  nature  of  the  specification  will  depend  on  the 
Conditioning  scheme  used.  For  example,  in  Nested 
Conditioning,  the  condition  code  register  is  assigned  by 
Fetch  Control.  Upon  notification  from  the  Predicate 
Evaluator  of  a  predicate  value.  Fetch  Control  performs  the 
PROMOTE  operation.  It  also  executes  FORK  and  CREATE 
operations  on  the  Converger  upon  detection  of  a  conditional 
branch. 
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The  Predicate  Evaluator  has  the  following  duties: 


1.  It  provides  storage  for  condition  codes  and  their 
corresponding  predicate  specifications. 

2.  Upon  a  condition  code  generation,  it  evaluates  the  predicate, 
if  its  specification  is  avaiLable,  and  notifies  Fetch  Control 
that  a  promotion  may  be  possible.  The  Control  Buffer  is  also 
notified  to  select  the  microinstructioii  for  initiation. 

?^ppendix  A  contains  a  description  of  the  activities  used  for 

an  implementation  of  a  Nested  Conditioning  scheme. 


3.3.8  Tl^  issuer  and  Control  Buffer 

The  Issuer  and  Control  Buffer  together  determine  the 
disposition  of  a  microoperation  once  it  emerges  from  the 
Converger.  They  undertake  many  of  the  duties  performed  by  the 
microprogrammer  for  traditional  microprogrammable  central 
processing'  units.  The  Issuer  examines  a  microope ra  tion  against 
an  Issue  Policy  and  if  the  conditions  set  out  are  satisfied,  the 
microoperation  is  issued  to  the  Control  Buffers.  If  the 
conditions  are  not  satisfied,  the  microoperation  is  returned  to  a 
delay  queue  in  the  Converger  to  be  examined  again  in  the  next 
cycle.  The  Control  Buffer  follows  through  with  the  disposition 
phase  and  determines  according  to  its  Execution  Policy,  when  and 
if  the  raicrooperation  is  to  initiate  execution.  These  policies 
are  dependent  on  the  hardware  structures  used  to  implement  the 
microoperation  interpretation  process  and  consequently,  on  the 
design  philosophy  used  to  synthesize  the  central  processing  unit. 
As  will  be  seen,  hardware  can  allow  considerable  latitude  in 
apportioning  the  disposition  activities  between  the  Issuer  and 
Control  Buffer.  Indeed  some  of  the  control  duties  could  be 
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undertaken  by  the  function  units  in  the  operation  unit,  further 
enlarging  the  spectrum  of  cantrol  distribution.  We  will 
concentrate  on  describing  how  the  Issuer  and  Control  Buffer 

combine  to  ensure  that  the  microoperation  stream  is  correctly  and 

/ 

efficiently  executed. 

The  hardware  structures  that  will  be  examined  in  this  context 
are  the  standard  microinstruction  register,  buffers  for 
microoperation  sources,  buffers  for  microoperation  destinations, 
dynamic  work  register  allocation  schemes  and  virtual  function 
units.  These  will  be  described  separately  in  later  sections 
after  the  following  section  on  hazards. 

3-^  Execution  Hazards 

The  starting  form  of  the  input  stream  is  assumed  to  be  in  the 
form  of  vertical  microcode.  The  Issue  and  Execution  Policies 
must  ensure  that  the  resulting  execution  stream  is  equivalent  to 
the  sequential  execution  of  the  input  stream  for  all  possible 
input  streams.  In  addition  to  satisfying  the  data  constraints 
for  equivalence,  the  policies  must  also  ensure  that  the  resource 
requirements  of  the  microoperation  are  met  and  that  only  active 
microoperations  are  executed.  First  the  effects  of  data 
constraints  will  be  discussed.  A  few  definitions  will  aid  the 
description. 

If  1^  and  1*3  are  two  microoperations,  ly  precedes  I3  if  1^ 
occurs  before  I3  in  the  input  stream.  The  sets  of  sources  and 
destinations  of  a  microoperation  I  are  denoted  by  s  (I  )  and 


D  (I  )  respectively.  The  definitions  S(P)  and  D  (P)  are  straight 
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forward  extensions  when  P  is  some  set  of  microoperations.  I3  is 
an  immediate  successor  of  Ii  if  precedes  ,  and  there  exists 
a  d  e  0(13^)  A  S  (Ij )  such  that  there  is  no  such  that  precedes 
1^'  Ik  precedes  Ij  ,  and 'd  e  D  (Ij^)  . 

li  is  locall][  of  I3  if 

(i)  D(Ii)AS  (Ij)  =  (p 

(ii)  S(Ii)AD(l3)  =  (p 

(iii)  D(li)AD(l5)  =  (p 

Locally  independent  microoperations  use  disjoint  sets  of  work 
registers.  and  are  said  to  be  slobally  independent  if 

and  I5  are  locally  independent  and  if  precedes  Ij  ,  is 

locally  independent  of  all  I|^  that  precede  but  not  Ii*  Global 
independence  has  been  studied  in  the  literature 
[ EER6 5, ALL6 9, TF73 , DT76  ].  The  conditions  of  global  independence 
are  sufficient  for  allowing  and  I3  to  be  executed  in  an 

arbitrary  order  without  affecting  execution  equivalence.  The 

conditions  of  global  independence,  however  can  be  considerably 
relaxed  to  permit  concurrent  execution  of  microoperations 

4 

[  KEL75,DT76  ]. 

We  require  the  set  of  data  conditions  that  must  be  satisfied 
to  permit  the  safe  issue  of  a  microoperation  from  the  Converger 
to  the  Control  Buffer  for  execution  initiation.  These  are 
obtained  by  reworking  the  conditions  of  independence  and  it  is 
convenient  to  introduce  the  notion  of  a  hazard.  Briefly,  a 
hazard  occurs  when  microoperation  executions  are  not  synchron '.zed 
and  wrong  values  are  referenced.  They  can  occur  either  daring 
the  reading  of  sources  or  during  the  writing  to  the  destinations 


as  described  below. 
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A  microoperation  I  is  said  to  read  frcm  its  i * th  source,  S, 
when  a  copy  of  its  value  is  dedicated  to  the  i*th  input  of  the 
function  unit  assigned  to  execute  I.  For  example,  in  the  case  of 

a  conventional  operation  unit  which  has  no  buffering  of  function 

/ 

unit  inputs,  the  reading  of  S  occurs  when  the  contents  of  the 
work  register  specified  by  S  are  gated  to  the  i*th  input  of  the 

function  unit.  In  this  instance  all  of  the  sources  are 

simultaneously  read.  With  more  sophisticated  hardware,  sucK  as 
virtual  function  units,  the  i'th  source  can  be  read 
independently  of  the  other  sources  because  the  virtual  function 
unit  provides  operand  buffers.  In  this  case,  a  copy  of  the 

source  value  is  dedicated  to  the  microoperation  execution  when  it 
is  placed  in  the  operand  buffer. 

A  ffiicr ooperation  I  is  said  to  write  a  result  value  to  its 
i*th  destination,  d,  when  a  copy  has  been  entered  into  d  or  when 
a  copy  of  the  value  has  been  dedicated  to  all  sources  S  =  d  of 
the  immediate  successors  of  I.  This  dedication  of  result  copies 
includes  possible  operand  forwarding  capabilities  of  source 
buf f e  rs. 

There  are  two  types  of  data  hazards  in  conventional 

microoperation  execution  -  the  early-read  hazard  and  the  early- 
write  hazard.  The  execution  of  a  microoperation  1^  causes  an 
early-read  hazard  with  I^  if 

( i)  li  precedes  Ij  and 

(ii)  I3  reads  the  value  from  S  e  D (13^  )/s  S  ( I3 )  before 

has  written  a  result  value  to  S. 
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An  early-read  hazard  occurs  when  a  inicr oopera tion  prematurely 
reads  one  of  its  sources.  In  an  analogous  manner,  the  execution 
of  raicrooperation  I5  causes  an  early- write  hazard  with  if 


I 


Typel : 

(i) 

Ij^  precedes  Ij  , 

(ii) 

There  exists  a  d  g.  D  (I^  )aD  (Ij  )  and 

Ij  writes  into  d  before  Ij^  writes  into  d. 

Ty pe2  : 

(i) 

I^  precedes  Ij , 

(ii) 

I  precedes  Ij  ,  and 

(iii) 

There  exists  a  d  g  D  (I j) aS  (I^^)  aD  (I  j)  and  Ijwrites 
into  d  before  Ij^  reads  d. 

In  case  1,  I^  completes  execution  after  ,  overwriting  the 

intended  value  of  d.  Thus  all  immediate  successors  of  Ij  that 

reference  d  and  are  initiated  after  Ji  completes  will  use  the 

wrong  source  value.  In  case  2,  prematurely  overwrites  the 

value  created  by  Ij^ ,  causing  erroneous  reads  by  the  immediate 

successors  of  Ij,  that  access  this  value  after  I5  completes. 

Note  that  if  there  are  no  redundant  microopera tion  executions, 

case  1  is  a  subset  of  case  2.  In  a  correct  program,  an 

irredundant  microoperation  I^  would  have  an  immediate  successor 

Ij^  that  would  precede  Ij .  Case  1  is  included  because  Register 

Schemes  may  generate  redundant  executions  of  condition  code 

\ 

microoperations.  These  hazards  are  avoided  on  conventional 
micro  prog rammable  central  processing  units  by  properly  assigning 
the  micr ooperations  in  microinsiructions  during  microcode 
generation  and  by  status  flags  that  inhibit  microinstruction 
cycling  until  the  data  is  generated. 

K  discussion  of  such  hazards  in  pipeline  architectures  is 
given  by  Famamoorthy  and  Li  [ RL77  ].  They  describe  the  hazard 
potential  in  the  accessing  of  instructions,  memory  operands  and 
register  operands  as  well  as  describing  hardware  techniques  that 
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can  be  used  to  avoid  them.  A.s  we  assume  that  the  input 

microprogram  is  not  modified  during  its  execution,  we  will  be 

/ 

concerned  only  with  the  access  of  memory  and  register  operands. 
We  will  also  assume  that  main  memory  has  a  hardware  manager  that 
properly  queues  requests.  Thus  we  will  concentrate  mainly  on 
hazard  prevention  to  register  operands. 

In  addition  to  the  data  hazards  described  above,  adaptive 
processor  organizations  must  guard  against  stream  hazards.  h 

stream  hazard  occurs  when  a  conditional,  non-stream, 
microoperation  from  a  conditional  program  arm  writes  to  its 
destination  before  the  execution  condition  of  the  microoperation 
becomes  ACTIVE,  Stream  microoperations,  houcver,  are  executed  in 
the  Stream  Controller  as  soon  after  detection  as  possible.  This 
may  occur  before  the  block  that  contains  the  branch  is  activated. 
In  this  thesis  we  discuss  conditional  executions  only  briefly. 
We  issue  at  most  only  three  microoperation  streams  from  the 
currently  ACTIVE,  TRUE  or  FALSE  program  arms.  Thus  for  non- 
stream  microoperations  it  is  sufficient  that  the  adaptive 
processor  initiate  only  the  ACTIVE  microoperations  for  stream 
hazard  avoidance. 

The  data  and  stream  hazards  described  above  are  the  only 
possibilities  of  erroneous  access  to  sources  and  destinations 
that  we  will  consider.  If  in  the  execution  of  a  microoperation 
no  such  hazard  occurs,  we  say  the  microoperation  execution  is 

f 

hazard-free  or  safe.  This  notion  is  directly  extended. to  define 
the  hazard-free  executions  of  blocks  and  microprograms. 
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3.5  Machine  Operating  State 

In  addition  to  guaranteeing  hazard- free  microoperation 
initiations,  resources  must  be  allocated  by  the  Issuer  and 
Control  Buffer.  Information  for  all  these  activities  is  provided 
by  the  operating  stat e  of  the  machine.  The  operating  state 
communicates  such  information  as  the  executing  state  of  function 
units,  the  status  of  values  in  work  registers,  and  the  status  of 
microoperations  examined  previously  by  the  Issuer. 

The  function  units  conveys  the 
remaining  busy  time  of  an  executing  function  unit.  The  executirg 
state  is  represented  by  an  N- dimensional  vector 

Y  =  (y1 f y2, . , yi,. . . ,yN) 

where  N  =  the  total  number  of  function  units 

yj^  =  the  remaining  busy  time  measured  in  control  cycles 
of  the  i»th  function  unit. 

The  i*th  component  yi  communicates  to  the  initiation  controls  in 
the  present  cycle,  the  number  of  cycles  the  i’th  function  unit 
will  be  busy  after  the  beginning  of  the  next  cycle.  A.s  the 
execution  times  are  determinate,  this  information  can  be  used  to 
predict  when  a  function  unit  will  complete  its  current  execution. 
This  will  allow  the  function  unit  to  be  assigned  in  the  cycle 
before  its  execution  is  complete. 

Work  register  status  conveys  two  related  items  of 
information.  The  register  status  informs  the  -  system  of  its 
source  availability  and  if  the  register  can  accept  the  result  of 
a  microoperation.  Because  the  execution  time  of  a  microoperation 
once  it  has  been  initiated  is  determinate,  source  availability 
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can  be  predicted.  This  allows  source  values  to  be  accessed 
immediately  after  they  are  generated.  Thus  a  microoperation 
waiting  for  a  source  value  can  be  released  or  primed  for 
execution  in  the  cycle  before  the  value  becomes  available. 
Destination  availability  will  be  affected  by  the  hardware 
structure  used  and  it  can  also  be  predicted  in  the  cycle  before  a 
microoperation’s  initiation. 

The  microoperations  previously  examined  by  the  Issuer  can 
have  cne  of  the  following  dispositions:  executed,  executing, 

currently  issued,  or  delayed.  The  latter  three  directly  affect 
the  disposition  of  the  microoperation  currently  being  examined 
for  issue.  To  describe  the  dispositions  of  previously  examined 
microoperations  the  notion  of  a  microinstruction  is  extended  to 
define  an  executing  microinstruction,  current  microinstructions 
and  delayed  microinstructions. 

The  executing  microinstruction  is  basically  a  maximal 
microinstruction  that  specifies  information  the  Issuer  and 
Control  Buffer  need.  This  is  an  ordered  list  of  fields  for  each 
function  unit  in  the  operation  unit.  Each  field  specifies  the 
destinations  of  the  executing  microoperation,  the  remaining 
execution  time,  and  implicitly  by  its  position,  the  function  unit 
used.  The  executing  microinstruction  is  used  to  determine  which 
sources,  destinations  and  function  units  will  be  available  in  the 
next  control  cycle. 

* 

There  are  three  possible  current  microinstructions  that 

) 

specify  the  microoperations  that  have  been  issued  in  the  current 
scan  cycle  and  currently  reside  in  the  Control  Buffer.  The  three 
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current  microinstructions  are  called  active,  active-true  and 
active-false  and  they  contain  the  microoperations  issued  from  the 
ACTIVE,  ACTIVE  and  TFDE,  and  ACTIVE  and  FALSE  program  arms 
respectively.  The  current  microinstructions  specify  the 
microoperation  fields  giving  information  on  the  operation,  the 
required  sources  and  destinations.  The  function  unit  to  be  used 
is  implicitly  specified. 

There  are  three  delayed  microinstructions  one  for  each  of  the 
ACTIVE,  TFDE,  and  FALSE  program  arms.  They  are  essentially  FIFO 
queues  containing  microoperations  that  could  not  be  issued  in  the 
current  cycle.  The  specifications  of  the  delayed  microoperations 
can  be  used  to  convey  to  the  Issuer  which  sources  have  remaining 
references  and  which  work  registers  are  reserved  as  destinations. 
The  delayed  microoperations  are  reexamined  at  the  beginning  of 
the  next  scan  cycle. 

3 . 6  I ssue  and  Fxec uti on  Policies 

Brief  descriptions  of  hardware  organizations  that  might  be 
used  for  adaptive  processor  implementation  are  given.  Each 
description  is  then  followed  by  sets  of  conditions  that  embody  an 
issue  policy  and  execution  policy  that  combine  to  ensure  the 
hazard-free  execution  of  microoperations.  The  particular 
hardware  structures  that  we  examine  are  the  classical 
microinstruction  register,  function  unit  input  buffers, 

t 

microoperation  buffers,  work  register  input  buffers  or  function 
unit  output  buffers,  dynamic  register  allocation  and  virtual 
function  units  with  operand  forwarding. 
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The  conditions  that  must  be  checked  by  the  Issuer  and  Control 
Buffer  are  summarized  in  table  3, 3,  The  table  indicates  where 
the  conditions  may  be  checked,  whether  in  the  Issuer  or  Control 
Buffer.  It  also  indicates  which  checks  are  absolutely  necessary 
for  the  hardware  structures  we  will  consider.  Those  that  are  not 
absolutely  necessary  are  obviated  by  operand  buffering  and 
forwarding  organizations. 


Issuer  Control 
BUFFER 

Control  Buffer 

Field  Available?  X 


Source  Value 

A.vailable?  X  X 

Desti na ti on 
has  remaining 

References?  X  X 

Destination  not 
reserved  for  pre¬ 
ceding  microoperation 
result?  X  X 

Function  Unit 

A.vailable?  X 

Execution  Condition 

ACTIVE?  X 


J 

NECESSARY 


X 

X 


X 

X 


Table  3.3  Initiation  Check  List 
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In  specifying  the  conditions  of  the  issue  and  execution 
policies,  we  use  the  following  notation: 


m 

E 


Ix  x=A: 
x=T: 
x=F: 

Qx  x=A: 
x=T: 


x=F : 


VFU^.  =  1 
F^  =  1 
=  1 


The  microoperation  being  currently  examined  by  the 
Issuer  for  issue. 

the  set  of  microoperations  that  w5 11  remain  executing 
in  the  next  execution  cycle. 

the  set  of  microoperations  specified  by  the  current 
Active  microinstruction. 

the  set  of  microoperations  specified  by  the  current 
Active-True  microinstruction. 

the  set  of  microoperations  specified  by  the  current 
Active-False  microinstruction. 

the  set  of  microoperations  specified  by  the  delayed 
Active  microinstruction. 

the  set  of  microoperations  specified  by  the  Delayed 
Active-True  microinstruction. 

the  set  of  microoperations  specified  by  the  Delayed 
Active-False  mic j. oinstruction 


a  virtual  function  unit  of  type  f  is  available 
a  function  unit  of  type  f  is  available 
a  field  of  type  f  in  the  Control  Buffer  is  available 


P  =  (P 
P  =  T 
P  =  F 
Phase=R 

Phase=W 


Active  predicate  not  evaluated  this  cycle 
Active  predicate  evaluated  with  value=T 
Active  predicate  evaluated  with  value=F 
The  scan  phase  coincides  with  the  R  phase  of  the 
execution  cycle. 

The  scan  phase  coincides  with  the  W  (write)  phase 
of  the  execution  cycle. 


3.  6.  1  The  Microinstruction  Register  C ontrol  Buffer 


This  organization  is  a  direct  descendant  of  the  traditional 
microprogram  central  processing  unit.  The  Control  Buffer  is 
essentially  a  modified  microinstruction  register  having  a 
microoperation  specification  for  each  function  unit  in  the 
operation  unit.  The  Control  Buffer  is  diagrammed  in  figure  3.8. 
The  register  is  triplicated  to  accomodate  a  current 
microinstruction  for  the  Active,  Active-True,  and  Active-False 
microoperation  streams.  The  microinstruction 


selected  for 
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Figure  3.8  The  Microinstruction  Register  Control  Buffer 
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initiation  is  determined  by  the  outcome  9f  the  predicate 
evaluation  for  the  Active  arm  at  the  end  of  the  current  scan 
cycle.  If  no  predicate  evaluation  occurs,  the  Active 
microinstruction  is  initiated.  If  an  Active  predicate  evaluation 
occurs,  one  of  the  conditional  microinstructions  is  initiated,  as 
selected  by  the  predicate  outcome. 

Each  microoperation  specification  consists  of  a  set  of 
subfields  that  when  decoded,  activate  specific  control  points  in 
the  operation  unit.  There  may  be  some  variations  between 
microoperation  fields  for  different  class  microoperations,  but 
they  will  have  the  basic  format  shown.  The  S  fields  select  the 
sources  to  the  function  unit  which  in  this  organization  can  be  an 
immediate  constant,  IMK ,  that  comes  from  a  buffer  field  (not 
shown  in  diagram)  or  one  of  the  work  registers,  Ri.  OP  specifies 
the  function  to  be  executed  by  the  function  unit,  D  specifies 
the  destinations  where  the  results  are  to  be  transferred  and 
condition  code  specifies  the  condition  code  register  in  the 
Predicate  Evaluator  where  the  generated  condition  code  is  ta  be 
deposited.  The  field  B  specifies  the  remaining  number  of  cycles 
in  which  the  microoperation  field  cannot  be  reassigned.  The  Bi 
lines  emanating  from  the  B  fields  communicate  function  unit 
busyness  to  the  Issuer.  They  are  zero  detect  lines  of  counters 
that  decrement  once  each  cycle  from  the  initial  B  entry. 

This  is  the  simplest  instance  of  a  Control  Buffer.  Its  only 
function  in  determining  the  hazard-free  execution  initiation  of 
microoperations  is  in  selecting  which  of  the  current 
microinstructions  is  to  be  initiated.  Consequently,  the  brunt  of 
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the  disposition  phase  is  carried  out  by  the  Issuer.  The  Issuer 
must  examine  the  machine  operating  state  with  each  microoperation 
examined  and  then  update  this  state  according  to  the 
specifications  of  the  microoperation  and  its  disposition.  The 
details  will  be  examined  below. 

For  a  f-type  microoperation  m  with  execution  condition  x,  the 
following  issue  policy  must  be  satisfied  for  the  issue  of  m: 

ISSUE  INIT:  when  (0)  Phase  /  W, 

and  (1)  =  1a  =  1 

and  (2)  S  (m)  a  D  (  {E  v  Ix  v  Qx}  )  =  (p, 

and  (3)  D  (m)  a  D  ( {E  v  Ix  v  Qx} )  =  (p, 

and  (4)  D(m)AS(QA)  =  ^  'if  x=A 

D(m)AS({QA  Qx}  =  (p  if  x  =  F  or  T. 

TEEM:  When  m  has  been  input  into  the  Issuer 

Condition  (1)  ensures  that  an  f-function  unit  is  available.  It 
is  not  sufficient  that  a  control  field  is  available.  Waiting  for 
the  f-function  unit  prevents  early- write  hazards.  Otherwise,  a 
microoperation  delayed  in  a  control  field  might  have  its  sources 
overwritten  by  one  of  the  initiated  microoperations.  Condition 

(2)  ensures  that  no  early-read  hazard  will  occur  by  reqricsting 

.  « 

that  all  sources  be  available  in  the  next  cycle  whcu  the 
microinstruction  is  to  be  initiated.  Similarly,  condition  (3) 
requests  that  the  specified  destination  be  available  in  the  next 
cycle.  Condition  (3)  along  with  (4)  ensure  that  no  early-write 
hazard  occurs.  Condition  (4)  checks  that  there  are  no  remaining 
references  to  the  destinations  -  specified  by  '  m.  , Femaining 
references  can  be  specified  only  by  delayed  microoperations 
because  issued  microoperations  access  their  sources  in  the 
initial  phases  of  the  next  control  cycle. 


Result  deposition 
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occurs  in  phase  W  of  the  scan  cycle  and  does  not  overlap  with 

source  access.  Condition  (0)  prevents  issuing  when  results  are 

being  deposited  into  work  registers  because  their  contents  will 

be  in  flux.  Suitable  hardware  design  of  work  registers  can 

> 

eliminate  this  condition. 

If  any  of  the  conditions  is  not  satisfied,  m  is  transmitted 
to  the  delay  queue,  Qx. 

Issue  termination  may  appear  to  be  premature,  occurring 
immediately  after  Issuer  input.  This  termination  occurs  if  the 
Issuer  is  pipelined  and  can  begin  issuing  a  microoperation 
immediately  after  the  preceding  microoperation  has  been  accepted 
at  the  Issuer  input. 

Apart  from  control  field  availability,  the  above  conditions 
have  been  stated  in  terms  of  sources  and  destinations  of  the 
preceding  microoperations  and  the  issuing  microoperation.  It 
will  be  convenient  to  use  this  formulation  in  the  next  chapter 

when  performance  is  modeled.  The  Issuer,  however,  can  obtain 

\ 

equivalent  conditions  by  referencing  status  bits  of  the  work 
registers  and  control  fields  in  the  Control  Buffer.  This  will 
depend  on  the  actual  implementation  of  the  Issuer  and  work 
registers,  A  centralized  scheme  similar  to  the  scoreboard  of  a 
CCC6600  which  centralizes  the  machine  status  could  be  used, 
Pamamoorthy  and  Li  [PL77]  describe  an  alternative  scheme  for 
hazard  avoidence.  It  detects  conditions  having  a  form  similar  to 
the  issue  policy  described  above.  The  detectors  are  distributed 
through  the  stages  of  the  Instruction  execution  pipeline. 
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The  duty  of  the  Control  Buffer  is  to  initiate  the  Active 
microinstruction  at  the  beginning  of  the  next  cycle.  This  Active 
status  is  communicated  at  the  end  of  a  cycle  by  the  Predicate 
Evalu  ator . 

The  execution  policy  for  an  issued  f-type  microoperation  m 
with  execution  condition  x  is: 

EXECUTE  INIT:  When  (0)  Phase  =  P, 

and  (1)  =  1, 

and  (2)  P  =  (J)  if  x=A 

T  if  x=T 

F  if  x=F 

TERM:  When  results  are  deposited  on  the  function  unit 
output  bus. 

3.6.2  The  Source  offered  Control  BUFFER 

This  organization  is  very  similar  to  the  microinstruction 
Control  Buffer.  The  essential  difference  is  that  source  fields 
in  the  Control  Buffer  and  the  Window  delay  gueues  must 
accommodate  source  values  in  addition  to  source  specifications. 
Thus  when  a  microoperation  is  examined  by  the  Issuer,  the  source 
values,  if  they  are  available,  are  transmitted  along  with  the 
microoperation,  either  to  the  Control  Buffer  or  delay  queue. 
This  eliminates  condition  (4)  of  the  Microinstruction  Control 
Buffer  Issue  Policy.  The  benefits  of  this  organization  is  a 
simpler  Issue  Policy  that  increases  concurrency  of  execution 
[KEL75].  The  cost  is  the  increased  circuitry  necessary  to  carry 
around  source  values  with  microoperations. 
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The  Issue  policy  for  the  Source  Buffered  Control  Buffer  is 
stated  below. 

For  a  f-micr ooperation  m  whose  execution  condition  is  x,  the 
following  conditions  specify  the  issue  policy  that  must  be 
satisfied  for  m  to  be  issued: 

ISSUE  INIT:  When  (C)  Phase  7^  W, 

and  (1)  B^  =  1, 

and  (2)  S  (m)r\  D  (  {E  v  Ix  v  Qx  }  )  “  ip, 
and  (3)  D(m)aP({EvIxvQx  })  =  (p. 

TERM:  When  m  has  been  accepted  by  the  Issuer. 

As  with  the  Microinstruction  Control  Buffer,  only  function 
unit  availability  and  execution  condition  checking  is  required. 
Thus,  for  an  f-micr ooperati on  m  with  execution  condition  x,  the 
Execution  Policy  is  specified  as: 

EXECUTE  INIT:  When  (0)  Phase  =  R, 

and  (1)  =  1, 

and  (2)  x  =  A  if  P  =  <) 

F  if  P  =  F 

T  if  P  =  T 

TERM:  When  results  deposited  in  specified  destinations. 

An  example  will  illustrate  the  microoperation  ^logistics  using 
source  buffering.  We  will  consider  three  microoperat ions  m,  , 
and  m3  occurring  in  the  input  stream  in  the  given  order.  They 
are  related  through  the  relation 

D  (m,  )  A  S  ( m^,)  A  D  (m3 )  =  r 

where  r  is  a  work  register.  We  assume  that  m,  has  been  initiated 
in  some  cycle  i-k  and  that  it  will  terminate  at  the  end.  of  cycle 
i.  Hicrooperations  m^  and  m^  are  examined  by  the  Issuer  in  cycle 
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i,  nig^  being  delayed  and  being  issued.  Figure  3.,9  illustrates 
the  various  events. 


i-k 

i 

i+1  I 

R  W 

w 

R  W 

12345678 

45678 

12345678! 

iD^  initiates 


4v4 


4  4s 


mt 


f 

writes  into  r 
transmitted  to  ISSUER 
r  value  is  buffered 


ra^  initiates 
m^  terminates 

value  deposited  into  r 
is  issued 
is  delayed 


Figure  3.9  Source  Buffering  Events; 


In  figure  3.9,  portions  of  cycles  i-k,  i,  and  i+ 1  are  shown. 
Beading  of  source  values  by  initiating  microoperations  occurs  in 
phase  1  of  a  cycle.  The  writing  of  destination  values  occurs  in 
phase  8.  The  Issuer  examines  microoperations  in  phases  1  to  7. 
No  microoperations  are  examined  in  phase  8  because  the  writing  of 
results  occurs. 


In  cycle  i,  phase  4,  m^^  is  examined  by  the  Issuer  and  is 

delayed.  The  delay  is  not  due  to  the  source  value  in  r  which 
will  be  available  in  cycle  i+1.  Microoperation  m3  is  examined  in 
phase  5  of  cycle  i  and  is  issued.  There  is  no  danger  of  an 
early-write  hazard  by  initiating  m3  in  the  following  cycle.  The 
Issuer  knows  that  the  result  register  r  will  not  be  waiting  in 
cycle  i+1  for  a  value  generated  by  some  microoperation  preceding 
m3.  Furthermore,  because  source  values  are  buffered  '  when  the 
requesting  microoperations  are  being  examined  by  the  Issuer, 
there  will  be  no  outstanding  reads  when  the  result  value  of  m^  is 
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deposited  in  r.  This  is  illustrated  by  the  actions  on 
microoperation  in  cycles  i  and  i+1  of  figure  3. 9,  Without 
source  buffering,  the  issue  of  would  have  been  delayed. 

3.6.3  Source  and  Pf.sult  Buffering 

This  organization  is  an  extension  of  the  Source  Buffered 
Control  Buffer.  The  additional  hardware  feature  that  we  consider 
is  a  result  buffer  at  each  function  unit.  Further  extension, 
such  as  multiple  buffers,  is  also  possible.  The  Control  Buffer 
has  the  responsibility  to  determine  when  the  result  is  to  be 
deposited.  In  effect,  the  avoidance  of  a  class  2  early-write 
hazards  is  deferred  until  the  result  has  been  generated.  This 
shift  in  responsibilities  has  the  advantage  of  increasing  the 
concurrency  of  microoperation  executions  by  reducing  the 
restrictions  to  execution  initiation.  Result  buffering  could 
also  be  used  in  the  avoidance  of  class  1  early-write s.  However, 
this  technique  would  not  be  as  effective  as  source  buffering 
because  cases  will  arise  where  the  issue  of  a  microoperation  must 
be  unduly  delayed  waiting  for  the  result. 

Result  buffering  decomposes  a  microoperation  execution  into  a 
generation  phase  and  a  deposition  phase  that  can  proceed 
independently  of  each  other. 

A  relatively  simple  organization  that  exploits  result 
buffering  will  be  described.  Assume  that  each  function  unit  can 
buffer  only  one  result.  For  simplicity,  we  restrict  the  issue  of 
a  microoperation  if  its  destination  register  has  more  than  one 
pending  deposition.  This  restriction  is  not  necessary  for  hazard 


3-72 


free  execution,  but  it  results  in  simpler  control.  This  scheme 
requires  two  essential  capabilities  -  detecting  two  pending 
depositions  and  determining  the  sequence  of  the  two  depositions. 

The  method  of  detection  will  depend  on  the  implementation. 
For  example,  the  Control  Buffer  can  control  the  deposition, 
retaining  the  destination  specification  of  the  control  field 
until  the  deposition  is  initiated.  P^n  Inhibit  bit  contained  in 
the  destination  specification  can  provide  a  solution  to  both 
detection  and  sequencing. 

To  illustrate  the  use  of  the  Inhibit  bit,  suppose  that  a 
currently  executing  microoperation  ra1  specifying  destination  r 
has  been  assigned  to  a  control  field  f1  and  that  a  microoperation 
m2  which  also  specifies  destination  r  is  being  issued  to  control 
field  f  2 .  The  Inhibit  b.it  in  field  f1  is  set  after  the  issue  of 
m2.  The  Inhibit  bit  is  used  to  prevent  both  the  issue  of  another 
microoperation  specifying  r  as  a  destination  and  the  deposition 
of  the  result  by  m2  if  m2  completes  before  ml.  The  Inhibit  bit 
is  reset  in  phase  W  of  the  cycle  before  the  result  deposition  by 
ml  is  to  occur.  This  now  allows  the  issue  of  a  microoperation 
specifying  r  as  destination  in  the  deposition  cycle.  A.lso,  the 
result  deposition  of  m2  (if  the  result  becomes  available)  can 
proceed  in  phase  W  of  the  following  cycle.  Note  that  source 
buffering  prevents  class  2  early-write  hazards,  allowing  the 
second  deposition  in  the  following  cycle. 

A  set  of  Control  subpolicies  that  provide  sufficient 
conditions  for  hazard  avoidance  is  given  below.  The  tasks  used 
in  this  portion  of  the  microoperation  interpretation  job  are 
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Issue  (or  Delay)  a  microoperation  to  the  Control  Buffer,  Generate 
the  microoperation  result.  Deposit  the  result  in  the  specified 
destination.  The  following  subpolicies  are  formulated  for  a  j- 
type  microoperation,  m,  with  execution  condition  x: 


ISSUE  INIT:  When  (0)  Phase  /  W, 

and  (1)  =  1, 

and  (2)  S  (m)AD  (  (E  v  Ix  v  Qx} )  =  (p 
and  (3)  D(m)/sD({E*v  Ix*v  Qx}  )  =  (f), 

TERM:  When  next  phase  begins 

The  set  E'  used  in  condition  (3)  is  the  set  of  executing 
microoperations  whose  destination  is  specified  as  the  destination 
of  some  preceeding  microoperation  that  is  currently  executing. 
These  microoperations  have  set  the  Inhibit  bit  to  prevent  the 
issue  of  any  microoperation  whose  destination  specifies  this 
register.  Similarly,  Ix*  is  the  set  of  issued  microoperations 
whose  destination  is  specified  as  the  destination  of  a  preceding, 
issued  or  executing  microoperation. 


GENERATE  INIT:  When  (0)  Phase  =  R, 

and  (1)  =  1, 

and  (2)  P  =  (J)  if  x  =  A 

?  if  X=F 

T  if  x=T.  , 

TERM:  When  generated  values  are  placed  in  result  buffers. 

DEPOSIT  INIT:  When  (0)  Phase  =  W, 

and  (1)  Result  value  is  in  output  buffer, 
and  (2)  Destination  Inhibit  bit  =  0. 

TERM:  When  write  phase  ends. 


3. 6. 4  Virtua 1  Function  Units  and  Pesult  Forwarding 


This  scheme  is  an  extension  of  source  and  destination 
buffering.  It  encompasses  the  concept  used  in  the  360/91 
floating  point  unit  [ TOM67, KEL7 5]  that  has  been  described 
earlier.  The  essential,  additional  feature  is  the  automatic 
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forwarding  of  results  to  both  the  work  registers  and  the  source 
buffers  that  request  them.  This  extension  defers  the  avoidance 
of  early-read  and  early-wri te  hazards  to  the  Control  Buffer.  h 
consequence  of  this  shift  in  responsibility  is  a  reduction  in  the 
complexity  of  the  control  sub- policies.  The  Issued  and  Delayed 
microinstructions  are  no  longer  factors  in  the  conditions  for 
hazard  avoidance. 

The  Control  Buffer  now  consists  of  sets  of  virtual  function 
units,  one  set  for  each  class  of  function  unit.  Each  virtual 
function  unit  can  buffer  one  microoperation  that  is  in  any  state 
of  execution  readiness.  When  it  becomes  execution  ready,  its 
execution  can  be  initiated  on  a  function  unit.  The  virtual 
function  unit  set  must  have  a  scheduling  scheme  to  resolve  the 
conflicts  when  there  are  more  execution  ready  microoperations 
than  available  function  units  for  execution.  Because  execution 
completions-  can  be  predicted,  the  scheduler  can  determine  the 
number  of  available  function  units  and  assign  execution  ready 
microoperations  on  a  first  come,  first  serve  basis,  or  by  a 
circular  polling  of  the  virtual  function  units. 

The  routing  of  results  can  be  accomplished  by  a  scheme  very 
similar  to  the  360/91  common  data  bus  [T0M67].  Each  virtual 
function  unit  is  assigned  a  tag  that  identifies  it  as  the 
originator  of  a  result  value.  The  virtual  function  unit  also  has 
source  fields  that  specify  the  originator  of  source  values  and 
buffers  to  store  these  source  values.  The  work  registers  also 
have  a  tag  field  that  specifies  the  originator  of  a  source  value. 
These  tags  are  assigned  by  the  Issuer  when  a  microoperation  is 
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assigned  to  a  virtual  function  unit.  When  a  microoperation 
completes  execution  on  a  function  unit,  the  tag  field  and  result 
are  broadcast  on  the  function  unit  output  bus.  All  virtual 
function  unit  source  tags  or  work  register  tags  that  match  the 
emitted  tag,  in-gate  the  accompanying  result  velue. 

The  tasks  used  to  describe  this  phase  of  microoperation 
interpretation  are  Issue  to  Control  Buffer,  Generate  the  result, 
and  Deposit  the  result  value.  The  follov/ing  subpolicies  are 

formulated  for  an  f-type  microoperation  m  with  execution 
condition  x: 

ISSUE  INIT:  When  (0)  Phase  /  W, 

and  (1)  VFU^:  ^  T. 

TERM:  When  next  phase  begins. 

GENERATE  INIT:  When  (0)  Phase  =  R, 

and  (1)  F^-  =  1, 

and  (2)  S  (m)  a  D  (  {FvIXvQx}  )  =  <p 

and  (3)  P  =  (P  for  x=A 

F  for  x=F 

T  for  x=T 

TERM:  When  result  on  function  unit  output  bus 

DEPOSIT  INIT:  When  (0)  Phase  =  W, 

and  (1)  Destination  tag  =  broadcast  tag. 

TERM:  When  next  phase  begins.  > 

3.  6.  5  Conditional  Execution 

Conditional  Execution  is  a  conceptually  simple  extension  of 
the  previous  scheme.  This  scheme  defers  stream  hazard  avoidance 
to  the  Deposit  task.  Conditional  microoperations  are  initiated 
when  the  data  dependencies  are  resolved.  However,  the  result  is 
not  transmitted  unless  a  predicate  evaluation  has  activated  the 
microoperation.  The  following  subpolicies  are  directly  derived 
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from  the  previous  set.  The  subpolicies  for  a  f-type 
microoperation  m  with  execution  condition  x  are: 

ISSUE  INIT:  When  (0)  Phase  f  W, 

and  (1)  VFUj:  =  1. 

TERM:  When  next  phasie  begins 

GENERATE  INIT:  When  (C)  Phase  =  R, 

and  (1)  =  1, 

and  (2)  S  (m)  a  D({EvIxvQx})  =  tj). 

TERM:  When  result  is  in  function  unit  output  buffer. 

DEPOSIT  INIT:  When  (0)  Phase  =  W, 

and  (1)  p  =  F  for  x=F 

T  for  x=T, 

and  (2)  Destination  tag  =  broadcast  tag. 

TERM:  When  next  phase  begins. 

The  minimal  additional  hardware  added  to  the  previous  scheme  to 
implement  Conditional  Execution  is  an  output  buffer  register  to 
hold  conditionally  generated  results  along  with  the  execution 
condition  until  the  predicate  evaluation  occurs.  When  the 
predicate  is  evaluated,  the  selected  results  would  then  be 

forwarded  to  their  destinations.  The  benefit  of  this  scheme  is 
the  additional  execution  overlap  during  the  predicate  resolution 
interval  of  a  conditional  branch. 

Clearly,  the  degree  of  conditional  execution  can  be 
considerably  increased.  This  can  be  realized  by  extending 

conditional  executions  to  higher  level  arms  in  the  Converger. 
Some  of  the  changes  to  the  Control  subpolicies  and  the  hardware 
are  outlined  below. 

The  execution  condition  for  conditional  microoperations  will 
now  consist  of  several  predicate  values,  xyz  say.  For  GENERATE, 

initiation,  condition  (2) ,  for  a  microoperation  m  whose  execution 

condition  is  xyz,  would  be  changed  to 
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S  (m)  D  ({EUExOExyUExyzUIUIx.  •.  UQUQx,  ..  DQxyz}  )  =(f 


vhere  Exyz  is  the  conditional  executing  microinstruction  whose 
microoperations  have  execution  condition  xyz.  The  stream  hazard 
avoidance  condition  of  the  DEPOSIT  task  will  also  change.  The 
condition  (1)  becomes 


P 


FFF  for  xyz  =  FFF 

FFT  for  xyz  =  FF^T 

•  •  •  • 

•  •  •  • 


TTT  for  xyz  =  TTT 

To  support  the  extra  computations,  additional  virtual 
function  units  and  function  units  will  be  required.  Copies  of 
the  work  register  set  will'  be  needed  to  accommodate  values  for 
conditional  executions.  The  Converger  operations  of  PROMOTE  and 
FORK  will  have  to  be  extended  to  include  the  conditional  work 
register  values.  In  addition,  the  identification  tags,  must  now 
include  execution  condition  information  for  proper  result 
routi ng. 


3.6.6  Dynamic  Re£ist e r  ^1 loca tion  ' 

We  conclude  the  section  on  the  Issuer  and  Control  Buffer  with 
a  brief  discussion  of  dynamic  register  allocation. 

Dynamic  register  allocation  is,  effectively,  a  *data'  Cache 
at  the  work  register  level.  Dynamic  register  allocation  assigns, 
dynamically,  a  work  register  to  the  destination  of  a 
microoperation*  A  register  map  is  used  to  translate  the  original 
specification  to  the  register  address  of  the  physical  register 
assigned  to  the  destination.  All  source  specifications  are  also 
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translated  to  actual  register  address*  In  an  adaptive  processor, 
these  translations  can  be  effectively  overlapped  with  the 
microoperation's  traversal  through  the  Stream  Controller. 

Dynamic  register  allocation  can  be  used  in  conjunction  with 
the  previously  described  Issuer  and  Control  Buffer  organizations. 
Its  use  can  simplify  early- write  hazard  avoidance.  The  virtual 
function  unit  with  automatic  result  forwarding,  however,  provides 
a  superior  solution  to  this  problem.  An  area  in  which  dynamic 
register  allocation  can  be  usefully  exploited  is  in  providing 
graceful  register  switch  overs  on  transitions  to  new 
microprograms  and  interrupts. 

s  ♦ 

3.7  A  Mul^pll 

To  demonstrate  the  operation  of  an  adaptive  micropiogramm able 
control  unit,  the  state  or  actions  of  various  adaptive 
microprogrammable  control  unit  blocks  will  be  described  as  the 
.adaptive  processor  steps  through  a  multiply  computation.  The 
algorithm  operates  on  sign  and  magnitude  numbers  fetched  from 
successive  main  memory  locations,  returning  the  most  and  least 
significant  product  parts  to  the  immediately  following  locations. 
The  algorithm  description  in  an  assembly-like  language  is  given 
in  figure  3.10.  The  operation  unit  organization  is  shown  in 
figure  3.11. 

The  interpretation  of  most  microoperations  in  figure  3.10 
should  be  clear.  The  basic  syntax  of  a  microoperation  is 


<label> : <opcode>  <sources>; <d estinations> ; <condition  code> 
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Labels  are  optional  and  are  the  destinations  of  branches.  The  op 
code  specifies  the  operation  and  the  function  unit  type.  In 
figure  3.10  identical  opcodes  have  appended  integers.  These  are 
used  for  identification  in  following  discussions.  Sources 
usually  refer  to  work  registers  or  to  immediate  constants  which 
are  preceded  by  a  #.  Destinations  of  non-branch  microoperations 
refer  to  work  registers.  Microoperations  that  affect  condition 
codes  specify  a  condition  code  register  in  the  second  destination 
field.  This  register  is  later  specified  by  a  conditional  branch 
that  uses  the  value  in  its  predicate  evaluation.  For  branch 
microoperations,  the  first  source  field  specifies  the  condition 
code  register  and  the  second  field  specifies  (implicitly)  the 
expected  condition  codes  for  the  predicate  to  be  satisfied. 


\ 
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1 

LD1 

aBASE; MPIER 

2 

ADI 

BASE, #1 ;BASE 

3 

LD2 

MPIER; SIGN 

4 

LD3 

aEASE; MCAND 

5 

AD  2 

BASE, #2; BASE 

6 

LD4 

#0011 ;C00NT 

7 

SB1 

RESULT,RESOLT;RESDLT 

8 

e 

SIGN,  MCA.ND;SIGN 

9 

A1 

SIGN, #1000; SIGN 

10 

A2 

MCAND, #01 11 ;MCAND 

11 

START:  SB2 

COUNT, #1  ;COUNT;CC2 

12 

SR 

RESULT, #1 ; RESULT, SO 

13 

SRI 

M PIER, #1 , SO; MPIER ;CC1 

14 

B1 

condition  cod'e1,S0=0; 

15 

AD3 

RESULT, MCAND; RESULT 

16  LOOP: 

B2 

condition  code2,  >  0; 

17 

V 

SIGN,RESULT;RESDLT 

18 

ST1- 

MPIER ;aBASE 

19 

SB3 

BASE, #1 ; BASE 

20 

ST2 

RESULT; aBASE 

21 

END 

/*  Load  -  fetch  operands  ♦/ 

/♦  Add  ♦/ 

/♦  Register  transfer  */ 

/♦  point  to  least  sig. 

product  part  loc.  ♦/ 

/*  Initialize  iteration 

count  for  4  bit  MPIER  */ 

/*  Subtract- initial! ze  most 
sig.  prod,  register  / 

/*  Exclusive  OR- generate 
product  sign  bit 
/*  AND  */ 

/*  Strip  sign  bit  from 
multiplicand  */ 

/*  Set  condition  code  CC2  */ 

/*  Shift  right  product,  save 
shift-out  */ 

/«  Shift  right  WPIER  with 
shift-in  from  SO,  Set 
condition  code  CC 1 
according  to  shift-out  */ 

LOOP  /*  Branch  if  shif t- out=* 0  * 

START  Branch  if  COUNT  >  0 

/*  OR  sign  bit  to  product  ♦/ 

/*  Store  L.S.  product  part 
/*  Point  to  K.S.  prod,  part  */ 

Microprogram  completed  */ 


Mnemonics 


SBASE  : 

BASE  ; 
MPIER  : 
MCAND  : 
RESULT  : 
SIGN  : 
COUNT  : 
SO  : 


Pointer  to  data  region  in  main  memory.  Initially  points 
at  multiplier. 

BA.SE  contains  a  main  memory  address 

Register  containing  multiplier  and  L.S.  product  part 

Register  containing  the  multiplicand 

Register  containing  M.S.  product  part 

Register  used  to  generate  the  sign  bit 

Loop  count  register 

Shift  out  register 


Condition  Codes 

S0=0  :  the  shift-out  bit  is  *0* 

>0  :  the  function  unit  output  is  greater  than  zero. 

Figure  3. 10  Multiply  Algorithm 
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The  information  flow  of  the  operation  unit  as  diagrammed  in 
figure  3.11  shows  an  output  bus  for  each  function  unit  and  work 
register.  Inputs  to  a  particular  device  are  selected  by 
activating  one  of  its  eligible  inputs,  a  subset  of  the  output 
buses.  Immediate  data  is  supplied  by  buses  originating  in  the 


control 

register  section. 

Arithmetic  and 

log 

ic  operations 

are 

ex  ecu  ted 

by  either  one 

of  the 

identical 

units  FU1,  or 

FU2. 

Shifting 

is  performed  by 

either 

SHIFT1 

or 

SHIFT2.  In 

this 

example,  the  shift  or  the  function  units  provide  the  data  path 
for  immediate  data  moves  and  transfers  between  registers.  Main 
memory  requires  an  address  input  and  a  data  input  for  a  store 
(ST)  operation.  For  a  load  (LD)  operation,  main  memory  requires 
an  address  input  and  provides  data  to  the  specified  destination 
register. 

The  hardware  facilities  for  the  shift  right  microoperations, 
SR  and  SRI  illustrate  how  vertically  compatible  microoperations 
can  be  implemented.  In  the  code  of  figure  3.10,  S?  and  SRI 
operate  on  different  registers  but  SRI  shifts  in  the  bit  shifted 
out  by  SR.  These  can  be  executed  in  the  same  cycle  by  activating 
a  path  from  the  shift-out  of  one  SHIFT  unit  to  the  shift-in  of 
the  other.  In  a  dynamic  environment  both  microoperations  must  be 
issued  in  the  same  cycle.  This  circumstance  .cannot  always  be 
guaranteed  because  the  successor  microoperation  might  be  waiting 
for  data  or  not  even  be  detected  when  the  first  microoperation  is 
issued.  One  solution  would  be  to  have  the  adaptive  processor 
issue  the  first  part  of  a  vertically  compatible  microoperation 
set  but  to  delay  execution  initiation  if  the  remaining  part  is 
not  issued.  This  would  require  information  in  the  microoperation 
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F.lpuro  3.11  Operation  Unit  For  Multiply  Example 


3-82 


format  that  indicates  it  is  part  of  a  vertically  compatible  set. 
A  second  solution  would  be  to  specify  a  destination  register, 
either  dedicated  or  one  of  the  universally  accessible  work 
registers  for  the  passed  output.  This  register  is  then  specified 
as  an  input  to  the  successor  microoperation.  To  maximize 
performance,  register  bypass  logic  should  be  provided  to  allow 
issue  and  execution  of  vertically  compatible  microoperations  in 
the  same  cycle.  In  figure  3.11  a  dedicated  shift-out  register, 
SO,  is  provided  for  this  purpose. 

Microcode  for  a  horizontal  micro programmable  computer  can  be 
generated  by  decomposing  the  code  in  figure  3.10  into  blocks  and 
then  examining  the  data  flow  within  each  block.  Figure  3.T2a 
shows  the  decomposition  of  the  algorithm  into  blocks, 
illustrating  the  data  flow  within  blocks.  Each  microoperation  is 
labelled  by  its  opcode  and  the  destination  register  that  it 
affects,  if  any.  Figure  3.12b  shows  the  control  flow  between 
blocks.  In  3.12a  vertically  compatible  microoperations  are 
enclosed  by  dashed  circles.  A  dashed  edge  between 
microoperations  indicates  that  the  predecessor  >  references  a 
register  altered  by  the  successor.  Under  certain  circumstances, 
such  microoperations  can  be  executed  siraulataneously.  The 
horizontal  microcode  for  the  multiply  algorithm  is  shown  below. 

1.  LD1,AD1,LD4, SB1 

2.  LD2,LD3,AD2 

3  .  »,  A2 

4  .  A1 

5.  SE,SEI,E1 

6.  AD3,SB2,B2 

7.  SB2,B2 

8.  V,ST1,SB3 

9.  ST3 
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I 

In  figure  3.12a,  forward  inter-block  edges  between 
microoperations  have  also  been  shown.  These  are  useful  in  global 
microprogram  optimizations.  Some  are  conditional,  depending  on 
whether  or  not  a  block  is  entered.  Backward  inter-block  edges, 
not  shown,  are  useful  in  optimizations  between  different 
iterations  of  a  loop. 
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Figure  3.12a 

Multiply  Algorithm  Data  Flow 
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Figure  3,  1 2b 

Multiply  Algorithm  Control  Flow 


The  following  is  a  list  of  the  hardware  specifications  for 
the  adaptive  processor: 


Scan  phases 

Microoperation  execution  times: 

Main  Memory  microops 
All  others 

Control  Memory  access  time 
Cache  latency  (4  pipelined  sections) 
Control  Memory  and  Cache 
word  size 

Window  transmission  rate 
Issuer  transmission  rate 

(3  pipelined  sections) 
Converger  levels 

Conditioning  Scheme  -  Register,  with 


-10/execution  cycle 

-5  execution  cycles 
-1  execution  cycle 
-1  execution  cycle 
-4  phases 

-8  microoperations 

-10  microoperations/cycle 

-10  microoperations/c.yc le 
-3 

promotion  at  all  levels. 


The  multiply  algorithm  will  step  through  a  multiplication  of 
4-bit  sign  and  magnitude  numbers.  In  the  example,  the  multiplier 
used  is  X.  100  where  X  is  the  sign  bit*  The  input  microoperation 
stream  is  given  in  figure  3.10* 

Table  3*4  outlines  the  hardware  actions  and  states  of  the 
adaptive  processor  as  the  multiply  algorithm  is  executed.  Column 
1  identifies  the  cycle  number,  column  2  displays  the 
microoperations  initiated  in  the  cycle  and  column  three  displays 
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the  nicr  cope  rations  in  the  order  that  they  resided  in  the  window 

for  the  cycle.  The  last  column  displays  the  possible 

microinstructions  that  have  been  composed  from  the  window 

contents.  The  initial  segment,  if  any,  lists  ACTIVE  only 

microoperations,  the  upper  tail  and  lower  tail  list  FALSE  only 

and  In  the  Window  contents,  the  F  arm  microoperations  are  on  the 

same  line  as  the  ACTIVE  arm  microoperations.  The  T  arm 

microoperations  are  on  the  line  below.  TRUE  only 

microoperations,  respectively.  The  fourth  column.  Control, 

summarizes  the  stream  control  actions  taken  by  Control.  B1A 

indicates  that  El  has  been  detected  in  the  microoperation  stream 

and  that  the  target  address  is  being  generated.  El  P  indicates 

that  the  El  Predicate  will  be  evaluated  at  the  end  of  the  cycle, 

allowing  the  activation  of  one  of  B1*s  successor  blocks.  Column 

5,  labelled  Control  Memory,  indicates  the  actions  taken  to 

transfer  microoperations  to  the  Window.  In  the  case  where  a 

second  branch  is  detected  in  a  F  or  T  arm  of  the  window  while  a 

first  is  pending,  microoperations  are  transported  to  the  FF  or  TF 

arms  respectively.  After  the  branch  target  address  is  computed, 

\ 

microoperations  are  fetched  for  the  alternate  arms  -  FT  or  TI.  A 
description  of  the  significant  events  within  the  cycles  follows. 

In  this  example,  we  do  not  include  the  effects  of  immediate 
data  on  execution.  In  Control  Memory  immediate  data  follows  its 
referencing  microoperation,  occupying  the  succeeding 

microoperation  field.  Thus  a  Control  Memory  word  will  not  in 
general  contain  its  full  allotment  of  8  microoperations  and  thus 
the  effective  microoperation  transmission  rate  to  the  Converger 
is  reduced.  The  more  important  transmission  rates  of  the  Window 
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and  Issuer,  however,  need  not  be  degraded.  This  can  be  ensured 
by  increasing  the  Converger  storage  cell  width  to  accommodate  the 
immediate  data.  Consequently,  the  microoperation  stream  will  not 
have  gaps  of  immediate  data  when  it  is  input  into  the  Issuer. 


Cycle  numbering  begins  at  -1.  The  first  two  cycles  are  used 
to  prime  the  adaptive  microprogram mable  control  unit  pipeline 
sections  and  these  initial  actions  can  be  overlapped  with  the 
termination  of  the  previous  microprogram.  Actual  execution  of 
the  algorithm  begins  in  cycle  1.  A  more  detailed  description 
fellows: 


-1.  The  first  block  of  microoperations  is  fetched  and  placed  in 
cache. 

0.  The  first  block  is  passed  to  the  window  and  the  first 
microinstruction  is  composed.  The  second  block  of 
microoperations  is  fetched  and  placed  in  cache. 

1.  The  first  microinstruction  is  executed.  Branches  B1  and  B2 
are  detected  and  their  target  addresses  are  computed.  As  the 
targets  reside  in  cache,  they  will  be  forwarded  to  the  window 
four  phases  after  their  respective  branch  addresses  are 
c  ompu  ted . 

2.  SE2  and  SR  are  initiated.  The  shift  out  of  SR  will  be  stored 
in  SO  for  later  reference  by  SRI.  B2  has  been  detected  in 
the  T  arm  and  its  target  address  is  calculated.  As  the 
target  microoperations  reside  in  cache,  they  will  be 
transferred  to  the  TT  arm  four  phases  after  the  address 
computation  of  B2.  Because  SB2  is  initiated,  condition  code2 
will  be  available  for  B2*s  predicate,  permitting  the 
promotion  of  either  TT  or  TF  to  T  and  either  FF  or  FT  to  F  at 
the  end  of  the  cycle. 

3.  The  F  and  T  window  arms  are  filled  to  their  limit  as 
delineated  by  the  second  occurrence  of  B1  .  The  filling  of 
the  successor  arms  to  F  and  T  is  begun.  Two  conditional 
microinstructions  are  composed. 

4.  Neither  of  the  conditional  microinstructions  is' initiated 
because  B1 * s  predicate  has  not  been  resolved. 

.  LD1  will  complete  execution  in  this  cycle  and  MPIER  will  be 
available  in  the  next  cycle.  This  permits  the  issuance  of 
the  indicated  microoperations.  SB2  cannot  be  conditionally 
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issued  now  because  there  is  no  available  function  unit  for 
it, 

6.  In  this  cycle,  condition  code1  will  be  available,  allowing 
the  activation  of  either  the  T  or  F  arm.  Two  conditional 
microinstructions  have  been  issued,  one  of  which  will  be 
initiated  in  the  next  cycle, 

7.  A.fter  the  predicate  of  B1  was  evaluated  in  the  previous 
cycle,  the  T  window  •  arm  was  activated,  promoting  its 
successor  arms  TF  and  TT  to  F  and  T  respectively.  This 
results  in  the  window  state  shown,  B2  is  detected  in  F  and  T 
and  its  target  address  is  computed,  fetching  the  FT  and  TT 
arms  from  cache.  As  SB2  and  SRI  will  both  complete  execution 
in  this  cycle,  both  predicates  can  be  evaluated.  No 
microoperations  are  issued.  This  cycle  demonstrates  how  the 
adaptive  microprogrammable  control  unit  can  take  advantage  of 
fortuitous  data  dependencies.  A  *0*  was  shifted  out  from 
MPIER  and  the  execution  of  AD3  was  obviated,  permitting  the 
Adaptive  microprogrammable  control  unit  to  enter  the  second 
iteration  of  the  loop.  A  conventional  microprogrammable  CPD 
is  effectively  prevented  from  taking  advantage  of  this 
situation  by  microoperation  A2  (figure  3.12a). 
Microoperation  A2  forces  a  wait  for  MCAND  before  the' loop  may 
be  entered. 

8.  The  resolution  of  El  and  B2  in  the  previous  cycle  provides 
microoperations  from  the  third  loop  iteration.  Again  PI  and 
B2  are  detected  and  the  Control  acts  on  them. 

S.  Execution  of  the  final  iteration  begins.  Both  branches  will 
be  resolved  at  the  end  of  the  cycle. 

10.  An  END  microoperation  is  detected  by  Control  allowing  it  to 
commence  the  fetch  of  microoperations  from  the  next 
microprogram.  MCAND  will  be  available  in  the  next  cycle, 
allowing  the  issue  of  some  microoperations  that  reference  it. 
ST1  can  be  issued  because  there  are  no  remaining  references 
to  MPIER,  SB3  is  delayed  because  function  uniti  and  function 
unit2  are  assigned, 

11-20.  The  remaining  microoperations  are  executed. 

Microoperations  from  the  next  microprogram  could  be  prepared 
and  possibly  executed  during  these  cycles. 

For  this  example,  we  can  compare  the  adaptive  processor  and 
conventional  microprogrammable  computer  performances  and  storage 
requirements.  Using  the  horizontal  code  previously  listed,  the 
execution  time  can  be  calculated  for  four  bit  sign  and  magnitude 
numbers  using  the  execution  time  summary  of  table  3,5, 
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Microinstructions  No.  of  Execution  Cycles 


1 

2 

3,4 

5,6,7 


2xN 


5 

5 

2 


8 

9 


5 

5 


TABLE  3.5  Multiply  Execution  Summary  for  Horizontal  Control  Unit. 

For  the  example,  28  cycles  are  required.  This  can  be  generalized 
to  22+2N  cycles  where  N  is  the  number  of  bits  in  the  magnitude. 
Hand  simulations  were  also  repeated  for  all  significant  four  bit 
multipliers.  For  the  eight  possible  positive  multipliers,  the 
average  execution  interval  is  23.2  cycles.  Assuming  all 
multipliers  are  equally  probable,  the  adaptive  processor  executes 
faster,  on  the  average,  by  4.8  cycles.  This  corresponds  to  an 
improvement  over  the  conventional  microprogrammable  computer  of 
4^  8/2 8x1 00%=1 7 . 1 %.  In  this  example,  relative  improvement  would 
•decrease  with  an  increase  in  the  number  of  magnitude  bits  because 
the  opportunity  for  combining  microoperations  between  successive 
loop  iterations  does  not  exist.  In  the  next  chapter,  a  more 
general  performance  analysis  is  described. 

This  example  has  demonstrated  some  of  the  adaptive  processor 
actions  as  it  executes  a  microprogram.  In  particular,  its 
flexibility  in  selecting  microoperations  at  run  time  allowed  the 
execution  of  some  iterations  of  the  multiply  loop  before  the 
multiplicand  became  available.  This  is  not  possible  for  a 
conventional  microprogram mabl e  processor.  Cycles  6-9  show  the 
need  for  the  Converger  to  maintain  3  levels  of  the  program 
structure  tree.  The  short  length  of  the  multiply  loop  and  the  B1 
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microoperation  contained  in  it,  produced  a  rapid  sequence  of 
promotions.  These  promotions  cause  a  rapid  passage  of 
microprogram  blocks  through  the  Converger  tree.  The  ability  of 
the  Converger  to  promote  at  all  levels  also  contributes  to 
maintaining  stream  flow. 

To  analyze  control  memory  requirements,  estimates  for  the  bit 
lengths  of  function  unit,  SHIFT,  MEMORY  and  BR  microoperations 
and  immediate  data  are  required.  For  our  example,  3  bits  are 
sufficient  to  specify  a  register  or  to  indicate  that  immediate 
data  follows  for  an  operand.  To  specify  a  condition  code,  2  bits 
are  allowed,  a  condition  code  (predicate)  value  -  6  bits,  target 
address  displacement  -  6  bits,  shift  out  destination  -  1  bit. 

Opcode  field  lengths  are:  function  unit-5,  Shift-4,  Memory-1. 
These  bit  requirements  are  summarized  in  table  3.6. 

The  maximum  field  length  is  16  bits.  For  the  adaptive 
microprogrammable  control  unit,  there  are  5  types  and  they  can  be 
differentiated  by  a  2  bit  tag.  Memory  and  Branch 

microoperations,  because  their  field  lengths  are  less  than  16 
bits,  can  share  one  tag.  Thus  for  fixed  length  microoperation 
formats,  18  bits  are  required.  The  multiply  algorithm  has  21 
microoperations  and  9  constants,  requiring  a  total  of  540  bits. 

For  a  conventional  microprogrammable  processor, 

microoperations  do  not  require  a  condition  code  field  because 
condition  code  setting  information  is  conveyed  by  the  context 
preceding  a  branch  microoperation.  Similarly,  Dest2  is  not 
required  for  shift  microoperations.  Thus  for  a  maximal 
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horizontal  microprogrammable  processor,  the  folio 
microinstruction  fields  are  needed: 


2  function  unit  S14  bits  =  28 
2  Shift  5)13  bits  =  26 
1  Memory  5)7  bits  =  7 
1  Branch  6)12  bits  =  12 
1  Immediate  6)16  bits  =  2^ 

TOTAL  =89 


Using  this  microinstruction  format,  the  multiply  algor 
requires  some  9x89=801  bits. 


h  maximal  microprogram  control  unit  organization  is  usu 
wasteful  of  control  memory.  By  combining  only  statistic 
popular  combinations  of  microoperations,  control  memory  sav 
will  result.  The  following  set  of  microinstruction  for 
covers  the  example  microprogram  with  the  fe 
microinstructions: 


1  . 

2  . 

3. 

4  . 

5. 
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control  m 

Summa 
processor 
designed 
maximal  m 
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the  horizontal  microprogram mable  control  unit,  the  adaptive 
processor  Control  Memory  address  space  has  an  additional  two  bit 
requirement. 

3 . 8  G  enerat  inq  Microcod  e  for  an  A da_pt iye  P roces so r 

We  briefly  describe  the  steps  necessary  to  produce  a 
microprogram  to  be  executed  on  an  adaptive  processor.  The 
microprogram  can  be  generated  either  from  the  output  of  a 
compiler  or  from  an  existing  microprogram  for  an  operationally 
similar  microprogram. 

If  the  microprogram  comes  from  a  compiler  output,  we  assume 
that  control  structure  information  is  available  for 
rcicrooper ation  block  identification,  and  that  the  operands  have 
not  yet  been  assigned  to  work  registers.  Useful  control 
structure  information  would  include  links  to  the  immediate 
predecessor  and  successor  blocks. 

The  basic  problem  is  to  produce  a  microoperation  ordering 
whose  non- preemptive  list  schedule  will  result  in  an  adaptive 
processor  execution  time  that  is  consistently  less  thaa  the 
execution  time  on  operationally  similar  maximal  m icroprog ramm able 
processor.  List  schedules  are  ordered  sets  of  tasks  where  the 
ordering  conforms  to  some  priority  assignment  on  the  tasks.  h 
task  is  initiated  according  to  the  list  if  there  is  a  processor 
available  and  if  its  data  constraints  are  satisfied.  List 
schedules  have  been  extensively  studied  in  the  literature  for  the 
case  of  homogenous  processors  [GRA72,  ACD74 ,  GON77].  In  general, 
list  schedules  do  not  produce  optimal  schedules  that  reduce  the 
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maximum  completion  time.  In  addition,  they  are  subject  to 
counterintuitive  behaviour  where  the  maximum  completion  time  can 
increase  if  the  execution  times  of  tasks  are  decreased  or  if  the 
number  of  processors  is  increased  [GRA72].  Graham  has  derived  a 
general  bound  on  the  ratio  of  list  schedule  times  for  a  task 
system  (T,<*) 

w/Wo  <  1  +  (n-1)/n 

where  h,  =  completion  time  for  best  list  schedule  on  n 

processors 

w  =  completion  time  for  an  arbitrary  list  schedule 
on  n  processors 

On  an  empirical  basis,  Adam,  Chandy,  and  Dickson  [ACD74] 
compare  the  performance  of  several  heuristics  used  to  obtain  list 
scliedules  with  the  computed  lower  bound  derived  by  Fernandez  and 
Bussell  [FB73].  An  extensive  set  of  task  systems  was  analyzed. 
The  heuristics  compared  are: 

HLFET  :  Highest  levels  first  with  estimated  times 
HLFNET:  Highest  levels  first  with  no  estimated  times 
Random:  Task  priorities  are  randomly  assigned 
SCFET  :  Smallest  colevels  first  with  estimated  times 
SCFNET:  Smallest  colevels  first  with  no  estimated  times, 

In  the  definitions  above,  a  level  for  a  task  T  is  defined  as  the 

minimum  time  required  to  complete  the  execution  of  the  set  of 

tasks  after  T  is  initiated.  Thus  the  HLFET  list  would  order 

tasks  according  to  the  longest  remaining  path.  HLFNET  is  similar 

to  HLFET  except  that  all  tasks  are  assumed  to  have  unit  execution 

times.  A  colevel  of  a  task  T  is  the  minimum  time  by  which  the 

execution  of  T  can  be  completed  after  the  execution  of  the  task 

system  is  initiated.  The  results  of  Adam  et  al  [ACD74]  produced 

the  following  order  among  heuristics,  starting  with  the  best: 

HLFET,  HLFNET,  SCFNET,  FANDOM,  and  SCFET.  HLFET  was 
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significantly  superior  to  the  others,  producing  near-optimal 
schedules  (within  5%  of  the  lower  bound  completion  time)  in  all 
but  1  of  over  9D0  trials.  The  results  also  indicate  that  the 
worst  case  task  systems  and  schedules  realizing  Graham's  bound 
would  have  low  incidence  in  practice. 

The  above  results  are  not  directly  applicable  to  our 
situation.  Our  case  is  a  generalization  in  that  the  adaptive 
microprogramraable  control  unit  controls  sets  of  specialized 
processors  or  function  units  and  microprograms  have  more  general 
precedence  constraints  than  the  directed,  acyclic  precedence  used 
in  the  above  references.  In  assigning  ordering  priorities  to 
microoperations,  the  utility  of  other  heurist'’cs  should  be 
examined  with  research  similar  to  that  of  Adam  et  al  [ACD74]. 
For  example,  the  computation  in  figure  3,13a  uses  function  units 
of  type  A  and  B, 
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Figure  3,13a  Computation  using  two  specialized  Function  Unit  Classes 
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Figure  3.13b  Heuristics  and  their  lists  - 
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Figure  3.13c  Heuristics  and  their  processor  assignments. 
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In  figure  3.13a,  the  nodes  are  labelled  by  the  function  unit 
type  and  an  integer  to  identify  the  microoperation.  Outside  the 
nodes,  the  first  integer  corresponds  to  the  execution  time  of  the 
microoperation,  the  first  square  bracket  contains  the  level  of 
the  microoperation  and  the  second  square  bracket  is  the  function 
unit  usage  vector.  Because  of  space  considerations,  these  have 
not  been  included  for  nodes  A2-f--4.  The  information  is  identical 
to  that  under  node  A1 .  Each  component  of  the  function  unit  usage 
vector  corresponds  to  a  function  unit  class,  f,  and  specifies 

the  average  number  of  time  units  required  for  execution  by  the 
corresponding  microoperation  and  its  successors  on  an  f-function 
unit.  The  function  unit  usage  vector  is  used  by  a  heuristic 
called  Host  Utilized  Resource  (HUR)  .  HUH  selects  as  the  next 
microoperation  the  one  with  the  highest  component  for  placement 
in  the  list  schedule.  The  generated  list  schedules  are  shown  in 
figure  3.13b  and  the  function  unit  assignments  by  the  adaptive 
processor  are  shown  in  figure  3.  13c.  This  example  shows  that  in 
the  case  with  specialized  processors,  a  processor  class  may  be  a 
bottle  neck  and  that  this  class  will  determine  the  minimum 
completion  time. 

Another  aspect  that  deserves  consideration  is  the  effect  of 
conditional  branches.  If  the  condition  code  microoperation,  m, 
of  a  block  is  percolated  upstream,  the  set  of  microoperations 
that  follow  m  in  the  block  will  be  called  the  bl^ck  tail. 
Similarly,  the  set  that  precedes  m  will  be  called,  the  block 
Consider  the  case  where  the  m  does  not  percolate  beyond 
the  block.  Clearly,  the  microoperations  in  the  block  tail  will 
affect  the  schedules  of  the  microoperations  in  the  heads  of  the 
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immediate  successor  blocks.  Thus  the  domain  of  microoperat ions 
to  be  used  in  generating  a  list  schedule  for  a  block  can  extend 
beyond  the  block  boundary. 

For  example,  consider  the  HLFET  heuristic  and  suppose  that  a 
block  A  has  block  B  as  a  successor.  The  HLFET  heuristic  applied 
to  B  should  also  consider  the  ordering  of  the  microoperations  of 
the  tail  of  A  to  maximize  its  effectiveness.  The  reason  the  A 
tail  should  be  considered  is  that  B  will  be  activated  after  the 
condition  code  microoperation  of  A  is  executed  with  seme  of  the  A 
tail  microoperations  remaining  unscheduled. 

Two  difficulties  arise  in  any  attempt  to  utilize  block  tail 
orderings.  Both  stem  from  indeterm inacies  -  the  uncertainty  of 
the  block  tail  constituents  at  execution  time  and  the  uncertainty 
of  the  successor  block.  The  first  uncertainty  arises  because  one 
cannot  guarantee  that  microoperations  in  the  tail  will  not  be 
executed  before  the  predicate  is  resolved.  The  second,  and 
potentially  more  important,  occurs  because  the  tail  positioning 
may  be  optimizable  for  only  one  of  several  successors.  In  any 
case,  simulation  experiments  could  be  performed  to  determine  the 
utility  of  examining  interblock  microoperations  when  generating 
the  microoperation  ordering. 

When  generating  a  list,  one  will  usually  have  the 
microoperation  execution  times  available,  allowing  the  use  of  a 
superior  heuristic,  such  as  HLFET.  This  schedule  will  then 
reflect  this  set  of  times.  If  in  the  operation  unit  a  function 
unit  is  later  replaced  with  a  faster  one,  the  original  schedule 
will  not  accurately  reflect  the  change.  The  question  then  arises 
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whether  a  regeneration  of  the  list  is  warranted.  A.gain/ 
simulation  experiments  will  provide  an  answer.  Some  idea  is 
provided  by  the  study  of  Adam  et  al  [ACD74]  where  the  HLFEr  and 
HIFNET  heuristic  were  compared.  In  the  majority  of  cases,  the 
effectiveness  of  HLFNET  was  within  5%  of  HLFET.  This  would 
indicate  that  the  list  schedules  are  reasonably  robust  to  changes 
in  execution  times.  Robustness  of  lists  to  the  number  of 
function  units  is  another  study  area. 

It  can  be  appreciated  that  adaptive  processor  performance  is 
enhanced  by  the  percolation  of  the  condition  code  microoperation. 
Some  questions  that  naturally  arise  on  this  topic  will  be 
addressed. 

The  least  upper  bound  on  the  percolation  displacement  for 
maximum  effectiveness  is  given  approximately  by  the  expression 

d  =  SR  +  TP.IER 
where 

SR  =  scan  rate  of  the  Issuer 

TP  =  average  time  between  the  condition  code 

microoperation  initiation  and  predicate  resoluti on 
lER  =  average  execution  rate  of  microoperations. 

If  the  condition  code  microoperation  is  percolated  beyond  d 

microoperations,  the  adaptive  processor  will  not  be  able  to 

exploit  microoperations  in  the  successor  block  because  they  will 

be  beyond  the  range  of  the  Issuer *s  scan.  If  the  condition  code 

microcperation  is  not  percolated  d  microoperations,  there  may  be 

exploitable  microoperations  that  are  not  activated  in  time. 

Of  more  practical  significance  is  the  expected  displacement 
in  actual  programs.  Flynn  [FLY72]  reports  on  a  study  of  a  5 
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program  scientific  problem  mix  to  determine  the  frequency  of 
conditional  branch  instructions  and  on  the  displacement  of 
condition  code  setting  instructions  from  conditional  branches. 
The  study  also  tried  to  eliminate  as  many  branches  as  possible. 
The  study  achieved  a  conditional  branch  incidence  ranging  from  2- 
10%  and  observed  an  average  displacement  less  than  three 
instructions.  Flynn  did  not  report  how  much  effort  was  expended 
to  percolate  the  condition  code  instruction.  For  example,  was  it 
stopped  at  the  first  constraining  instruction  or  was  there  any 
effort  in  percolating  this  instruction  as  well? 

We  provide  a  simplified  analysis  for  some  quantitative 
insight.  J^.ssuming  ‘  that  the  dependency  of  a  condition  code 
microoperation  on  the  i*th  downstream  microoperation  is 
geometrically  distributed  with  mean  1/p,  then: 

P[Condition  code- microoperation  has  a  dependency 
on  the  i*th  microoperation] 

=  p  (l-p)’’’^ 

Thus  the  probability  that  the  condition  code  microope ration 
can  percolate  at  least  i  microoperations  is 

CO 

■-  •  ^ 

P[d  isplacement  >  i]  =  ^  P(1“P)^ 

=  (i-p)‘^ 

Table  3.7  tabulates  the  probabilities  of  various  minimum 
displacements  against  mean  displacements. 
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Minimum 

Mean 

Displacement 

Displacement 

2 

3 

4 

5 

6 

2 

.44 

.56 

.64 

.  69 

73 

4 

.20 

.32 

.41 

.48 

54 

6 

.09 

.  18 

.  26 

.  33 

40 

8 

.04 

.  10 

.17 

.23 

30 

10 

.02 

.  06 

.11 

.  16 

21 

Table  3.7  Percolation  Displacement  Probabilities 

Table  3.7  indicates  that  the  probability  that  a  condition 
code  microoperation  percolates  past  i  microoperations  quickly 
decreases  with  i.  It  also  indicates  that  modest  displacements 
are  possible  with  significant  probability.  Even  modest 
displacements  can  provide  some  opportunity  to  combine  some 
microoperat i ons  from  the  block  tail  and  the  successor  head. 

The  program  structures  that  are  synthesized  by  conditional 
branches  can  be  characterized  by  three  main  types  -  the  IF-THEN- 
ELSE,  the  INDEXED  LOOP,  and  the  DO  WHILE  LOOP.  Of  these,  the 
purely  INDEXED  LOOP  offers  the  most  opportunity  for  condition 
code  microoperation  percolation  because  the  loop  predicate  is 
virtually  independent  of  the  main  computation  in  the  loop  body. 
In  these  cases,  percolation  outside  the  loop  body  would  be 
logically  possible,  but  at  a  cost  of  increased  complexity  of  the 
Predicate  Evaluation. 

The  DO  WHILE  LOOP,  on  the  other  hand,  as  exemplified  by  the 
Newton-Fa phson  iteration,  has  its  predicate  dependent  on  the 
outcome  of  the  loop  computation.  In  this  case  condition  code 
microoperation  movement  is  relatively  restricted.  IF-THEN- ELSE 
constructs  do  not  have  a  characteristic  that  would  either  bind  or 
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release  the  condition  code-microoperation .  These  cases  may 
provide  a  relatively  large  deviation  in  mobility. 

When  generating  the  microoperation  ordering  for  the 
microprogram,  top  priority  should  be  assigned  to  the  condition 
code  microoperation  and  its  predecessors.  Because  the  adaptive 
processor  will  determine  data  dependencies  from  the  register 
specifications  a  good  starting  point  for  microcode  generation  is 
before  register  assignment.  This  facilitates  the  positioning  of 
the  condition  code  microoperation  and  its  predecessors  as  far 
forward  as  data  constraints  will  permit.  This  tactic  will  not  in 
general,  provide  the  best  list  schedule,  but  it  will  maximize  the 
number  of  microoperations  examined  by  the  Is?=^uer.  This  heuristic 
has  strong  intuitive  appeal  for  the  adaptive  processor. 
Afterward,  the  registers  could  be  assigned  and  the  remaining 
microoperations  positioned  according  to  the  HLFET,  HLFNFT,  or  MUR 
heuristics  or  some  combination.  These  heuristics  could  be 
applied  locally  to  blocks  or  they  could  be  applied  accross  block 
boundaries  into  the  tails  of  predecessor  blocks. 

If  one  has  existing  microcode  for  an  operationally  similar 
horizontal  microprogram  central  processing  unit,  one  can  proceed 
to  serialize  the  microoperations  in  the  microinstructions  into  a 
list.  First,  the  microoperation  formats  should  be  made  to 
conform  with  the  microoperation  format  of  the  adaptive  processor. 
Second,  care  must  be  taken  to  ensure  the  microoperations  are 
serialized  into  the  correct  order.  If  a  microinstruction 
contains  microoperations  m  and  n,  with  S(m)AD(n)^  (J) ,  m  should 
precede  n  in  the  list.  These  are  the  necessary  functions  to 


3-102 


seria lize 
enhance 
outli ned 


horizontal  microcode, 
performance  can  also  be 
in  this  section. 


Additional  embellishments  to 
made.  These  have  already  been 


3 . 9  Summary  of  th e  Properties  of  the  Adaptive  Processor 


The  adaptive  processor  has  several  advantages  and 
disadvantages  over  the  classical  microprogrammable  control  unit. 
These  stem  from  the  adaptive  processor*s  capability  to 
dynamically  compose  microinstructions  from  a  seguential  stream  of 
microoperations,  and  to  do  so  across  block  boundaries.  The 
following  is  a  list  of  these: 


Performance  Enhancement 

The  adaptive  processor  can  initiate  microoperations 
independently  of  others.  They  are  not  fixed  to  any 
microinstruction,  they  can  be  initiated  despite  preceding 
microoperations  being  blocked,  and  they  can  be  initiated  from 
a  conditional  block  once  the  predicate  is  evaluated. 
Furthermore,  the  adaptive  processor  can  overlap  the  fetch  and 
decode  phases  of  machine  instructions  (or  microprograms)  with 
the  execution  of  preceding  machine  instructions.  Because 
execution  of  microoperations  is  decoupled  from  control  memory 
fetching,  execution  cycles  are  not  lost  when  a  branch  out  of 
sequence  occurs.  A  classical  microprogrammable  control  unit 
does  not  have  these  capabilities.  ^ 

Superior  performance  is  not  always  ’  guaranteed  for  the 
following  reasons: 

1.  List  schedules  do  not  guarantee  minimum  completion 
time  and  they  are  subject  to  anomalous  behaviour. 

2.  In  program  sequences  having  a  high  incidence  of 
conditional  branches,  the  Converger  may  not  be  able 
to  maintain  an  adequate  microoperation  stream  rate. 

3.  The  Issuer  scan  rate  may  be  less  than  the  possible 
parallelism  of  the  input  stream. 

Simplified  Microcode  Generation 

Generating  a  list  of  microoperations  that  preserves  the 
intended  precedence  of  the  computation  is  straightforward. 
Such  lists,  providing  a  RANDOM  ordering,  have  maximum 


3-103 


completion  times  within  1055  of  lists  generated  by  the  HLFET 
heuristic  [A.CD74]  for  the  homogeneous  processor  case.  These 
results  could  be  improved  by  maximizing  the  condition  code 
m icrooperation  displacement  and  by  using  a  better  heuristic. 
On  the  other  hand,  when  generating  microcode  for  a 
conventional  microprogrammable  control  unit,  function  units 
must  be  explicitly  allocated  and  microoperations  must  be 
packed  into  microinstruction  according  to  format 
restrictions. 

Greater  Transportability  of  Microcode 

Adaptive  processor  microcode  should  be  transportable  between 
operationally  similar  processors  provided  the  conditioning 
schemes  are  compatible.  The  execution  of  the  code  is 
independent  of  the  number  and  speed  of  the  function  units  and 
is  independent  of  Control  Memory  speed.  This  is  not  true  for 
microcode  on  a  horizontal  microprogrammable  control  unit. 

Increased  Modularizability 

The  adaptive  processor  allows  the  operation  unit  to  be 
organized  in  a  modular  fashion,  allowing  variable  numbers  of 
function  unit  types  as  well  as  different  speeds.  This 
feature  requires  modification  to  the  machine  state  status  and 
update  hardware.  By  providing  asynchronous  interlocks 
between  Control  Memory,  Cache,  Converger,  and  Issuer,  the 
performance  parameters  of  these  system  blocks  can  be  set 
independently  to  provide  a  desired  performance  level. 
Conventional  microprogrammable  control  units  would  require 
extensive  hardware  modification  to  accommodate  changes  in 
function  unit  number.  Because  in  most  cases  function  unit 
speeds  will  also  be  synchronized  with  Control  Memory  timing, 
changing  function  unit  speeds  may  be  restricted. 

The  greatest  disadvantage  of  the  adaptive  processor  is  the 

cost  and  complexity  of  the  controls  needed  to  support  adaptive 

microprogrammable  control  unit  operation.  A  Cache,  Converger, 

Fetch  Control,  and  Predicate  Evaluator  are  essential  to  maintain 

acceptable  stream  flow.  A  method  that  maintains  the  operating 

state  and  communicates  it  to  the  Issuer  and  Control  Buffer  are 

necessary  for  the  hazard  free  initiation  of  micr oope rati ons. 

These  controls  are  not  needed  to  such  exacting  degrees  on 

conventional  microprogrammable  control  units  because  the  hazard 

avoidance  function  is  assured  during  microcode  generation  time  by 

the  precise  positioning  of  microoperations  into  microinstructions 
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and  by  the  setting  of  status  bits  in  the  microinstruction. 
However,  because  of  continual  advancements  in  semiconductor 
technology,  the  problems  of  cost  and  complexity  of  logic  are 
diminishing  in  importance. 


3.10  Furt her  Pesearch  and  Ada ptive  ^oc es so r  Extensions 

We  summarize  areas  related  to  the  adaptive  processor  that 
require  further  research  and  briefly  discuss  some  extensions  of 
the  adaptive  processor. 

A  fundamental  performance  limitation  of  the  adaptive 
microprogrammable  control  unit  is  the  scan  rate  of  the  Issuer. 
Increasing  the  scan  rate  here  meets  with  difficulty  because  of 
the  sequential  checking  of  microoperations  for  hazard  avoidance. 
Other  points  in  the  stream  flow  do  not  have  this  degree  of 
difficulty  because  of  their  essentially  independent  actions.  In 
these  cases  flow  rate  can  be  increased  by  paralleling  the  data 
paths.  Methods  that  allow  paralleling  of  Issuer  functions  should 
be  examined  to  see  if  any  simplification  is  ppssible.  The 
descriptions  of  the  Issuer  and  Control  Buffer  indicated  that  the 
simplest  Issue  conditions  occur  for  schemes  based  on  the  virtual 
function  unit  control  buffer  and  operand  forwarding. 
Consequently,  research  that  examines  such  schemes  may  produce 
organizations  with  higher  scan  rates.  Also,  parallel  issuing, 
once  considered  prohibitive  in  circuit  cost  [  AST67, FL Y72 ] ,  should 
be  reexamined  in  the  light  of  current  and  near  future 
capabilities  of  LSI. 


3-105 


Advanced  Conditioning  techniques  promise  to  shave  cycles 
computation  times  by  generating  the  predicate  in  parallel 
the  condition  code  microoperation  execution.  Function 
designs  should  be  examined  that  include  parallel  logic 
predicate  resolution. 

Research  on  ordering  heuristics  is  required  to  evaluate  t 
effectiveness  and  robustness  in  producing  list  schedu 
Simulation  experiments  driven  by  traces  of  rcpresenta 
benchmark  microprograms  as  well  as  randomly  gener 
microprograms  would  provide  such  information.  Simula 
experiments  would  also  illuminate  the  utility  of  various  adap 
processor  features  as  well  as  providing  comp-^nent  param 
values  for  desired  performance.  This  latter  aspect  is  exam 
analytically  in  the  next  chapter. 
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CHAPTER  4 

Central  Processing  Unit  Performance  Modeling 

This  chapter  describes  the  performance  modeling  of  a  class  of 
computers  called  basic  processors.  Their  input  requirements,  the 
machine  configuration  and  program  structure  are  discussed. 

A  section  is  devoted  to  describing  the  features  of  Bowra  and 
Torng's  Issue  Delay  model.  This  model  derives  performance 
results  by  obtaining  the  mean  time  between  successive  instruction 
issues.  The  section  closes  with  a  modified  model  that  includes 
the  effect  of  time  slippage  between  successive  instruction 
issues. 

The  central  part  of  the  chapter  describes  the  framework  of  a 
much  more  general  modeling  system.  The  system  closely  reflects 
the  computation  process  as  implemented  on  low  level,  parallel 
hardware.  The  model  focuses  on  two  aspects  of  this  computation 
process,  the  supplying  of  microoperations  to  an  Issuer  and  the 
issue  of  microoperations  for  execution.  These  'are  structured 
according  to  the  control  policy  and  the  underlying  hardware. 

The  supply  process,  in  the  context  of  prefetch  mechanisms 
such  as  the  Converger,  is  captured  by  the  Window  model.  The 
Window  model  quantifies  the  effects  of  stream  control 
microoperations  on  the  stream  controller  function. 

The  Scan  Space  model  differs  significantly  from  the  Issue 
Delay  model.  It  determines  a  probability  distribution  for  the 
number  of  microoperations  issued  in  a  cycle,  as  opposed  to  the 
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average  time  between  issues.  It  models  a  scan  over  a  sequence  of 
Active  microoperations.  This  allows  us  to  capture  simultaneous 
issues  in  a  cycle,  as  for  example,  on  horizontal 
microprogrammable  control  units. 

The  use  of  the  modeling  system  is  illustrated  by  deriving  the 
equations  for  some  of  the  basic  processor  organizations  described 
in  chapter  3.  A  detailed  analysis  is  given  for  the  maximal 
horizontal  organization.  The  remaining  processors  are  then 
modeled  using  the  maximal  machine  as  a  starting  point. 

; 

4 . 1  Basic  Processor  Specification 

Kecall  that  the  basic  processor  as  defined  in  section  2.7  has 
operations  that  operate  only  on  work  register  contents,  an  output 
bus  for  each  function  unit,  and  all  function  units  can  access  all 
registers.  As  such,  the  basic  processor  has  both  restrictions 
and  extensions  when  compared  to  any  real  machine. 

A  basic  processor  class  is  specified  jointly  by  a  control 
policy  and  a  machine  configuration.  The  nature  of  control 
policies  has  been  examined  in  chapter  3.  Control  policies 
describe  the  dynamics  of  the  microoperation  stream  as  it  passes 
through  the  various  stations  of  the  processors.  In  this  chapter, 
we  will  formulate  control  policies  for  various  basic  processors, 
and  use  them  to  derive  performance  models.  The  machine 
configuration  provides  a  quantitative  hardware  specification  of 
the  processor.  It  will  be  described  in  the  next  section. 
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Figure  4.1  Basic  Processor  Model 
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4.1,1  The  Machine  Configuration 

The  machine  configuration  we  will  use  is  an  expansion  of  the 
concept  described  by  Bowra  and  Torng  [BT76].  In  addition  to 
specifying  the  operational  unit  characteristics,  we  will  include 
control  unit  characteristics. 

The  operational  unit  characteristics  are  specified  by  the 
number  of  work  registers,  and  by  a  function  unit  vector  whose 
components  specify  the  function  unit  type  and  number.  A 
microoperation  -  function  unit  table  specifies  the  function  unit 
and  execution  time  for  each  microoperation.  We  will  assume  that 
a  microoperation  for  a  given  configuration  is  executed  by  one 
type  only.  We  also  assume  that  all  data  path  widths  in  the 
operation  unit  are  the  same. 

Memories  are  treated  as  function  units  when  a  microoperation 
requests  a  data  transfer.  Memories  that  also  supply  the 

microoperation  stream  or  memories  that  are  shared  with  external 
processes,  such  as  I/O,  require  additional  treatment.  For 
certain  classes  of  memory  wait,  an  average  execution  time  that 
includes  any  waiting  time  in  memory  queues  will  be  assumed.  This 
average  execution  time  can  include  the  effects  of  a  cache,  in 
which  case  the  effective  access  time  is  used.  Or  it  can  include 
the  interruption  of  a  dynamic  refresh  cycle.  Long  I/O  waits  will 
not  be  included. 

Work  registers  provide  a  supply  of  operands  that  can  be 

instantly  accessed  by  the  function  units.  Performance  is 

enhanced  by  a  large  number  of  work  registers  because  the  number 
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of  accesses  for  temporaries  to  slower  memory  is  reduced*  Except 
for  some  highly  specialized  registers  such  as  a  program  counter, 
we  will  assume  that  all  work  registers  are  universally  accessible 
by  the  function  units* 

The  control  unit  specification  includes  the  Control  Memory, 
Cache,  Converger,  and  Issuer  parameters,  the  number  and 

characteristics  of  the  control  registers  of  the  Control  Buffer, 
the  type  of  Conditioning  scheme  used  and  the  number  of  CC 
registers  in  the  Predicate  Evaluator*  In  this  chapter,  we  shall 
use  the  acronym  *CC*  in  place  of  coi^ition  code. 

The  Control  Memory,  Cache,  Converger,  and  Issuer  jointly 
determine  the  stream  flow  rate  to  the  operational  unit.  The 
important  characteristics  that  directly  affect  performance  are 
the  Window  capacity  and  the  effective  Issuer  scan  rate.  The 
Window  consists  of  those  Converger  arms  that  can  directly 

transmit  microoperations  to  the  Issuer.  Maximum  Window  capacity, 
Wj^,  the  number  of  microoperation  buffers  in  the  Window, 

determines  how  readily  microoperations  can  be  transmitted  to  the 

\ 

Issuer.  The  larger  the  capacity,  the  more  continuous  the  stream 
flow  to  the  Issuer  can  be.  The  Window  stream  flow  is  regulated 

by  the  provision  policy  of  the  control  policy  and  is  affected  by 

the  incidence  of  branch  raicrooper at  ions .  Capacity  dynamics  are 
analyzed  in  a  later  section* 

The  Issuer  scan  rate  determines  the  number  of  microoperations 

t 

that  can  be  examined  each  cycle  for  issue  to  the  Control  Buffer. 
It  is  a  function  of  its  hardware  organization.  It  is  an 
important  parameter  that  directly  affects  the  lER* 
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The  number  of  Control  Registers  determines  the  number  of 
microoperations  that  can  be  issued  concurrently  to  the 
operational  unit  for  execution.  Each  Control  Register  is 
dedicated  to  a  function  unit  type  and  is  used  to  buffer  a 
microoperation.  In  a  microprogrammable  control  unit,  these 
correspond  to  microoperation  fields  in  the  microinstruction 
register.  In  these  cases,  performance  may  be  restricted  by  the 
set  of  available  microinstruction  formats.  Control  Registers  may 
also  have  provision  for  buffering  source  values.  This  simplifies 
the  issue  policy  and  enhances  performance  by  reducing 
microoperation  data  interdependencies  [TOM67, KEL75  ]. 

The  Predicate  Evaluator  characteristics  will  not  be  used 
directly  in  the  performance  analysis  models.  We  will  assume  that 
the  Predicate  Evaluator  is  capable  of  resolving  predicates  when 
the  need  arises.  Our  assumptions  on  CC  microoperation  motion  are 
described  in  section  4.4  on  Window  analysis. 

The  following  definitions  summarize  the  basis  of  comparison 
between  basic  processor  models.  Two  basic  processors  P^  and  Pj 

are  said  to  be  operationally  similar  if  the  set  of 

microoperations  executed  by  P^  is  identical  to  the  set  of 

microoperations  executed  by  Pj  ,  and  they  have  the  same  number  of 

work  registers.  A  comparison  of  operationally  similar  processors 
is  very  general,  allowing  the  variation  pf  any  processor 
parameters.  Two  processors  P^  and  P^  are  operationally 
equivalent  if  they  are  ' 

1.  operationally  similar, 

2.  their  function  unit  set  is  identical,  and 

3.  they  have  the  same  memory  types  with  the  same 
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address  space. 

Operationally  equivalent  processors  are  intended  to  have 
identical  operation  unit  characteristics.  They  can  be  used  to 
compare  the  performance  aspects  of  control  unit  parameters, 
allowing  a  comparative  evaluation  of  control  policies. 

In  an  analogous  manner,  control  similar  and  control 
equivalen t  processors  may  be  defined  by  specifying  restrictions 
on  control  unit  parameters.  Two  processors  and  are  control 
similar  if: 

1.  they  are  regulated  by  the  same  Control  Policy  and 

2.  they  use  the  same  Conditioning  scheme. 

Two  processors  are  control  equivalent  if 
1  .  they  are  control  similar  and 

2,  thay  have  identical  control  unit  parameters. 

In  the  performance  analysis  models  we  will  assume  that 
microprograms  are  transportable  between  operationally  similar 
models.  We  have  previously  pointed  out  that  different  types  of 
Conditioning  schemes  will  detract  from  microprogram 
transportability.  The  effect  of  Conditioning  schemes  will  be 
included  in  the  effects  of  the  provision  policies  on  Window 
capac ity . 

4.1.2  The  Proqram  Structure 

The  processor  class  definition  provides  sufficient 
information  about  the  machine  for  generating  an  analytical 
performance  model.  Correspondingly ,  the  model  inputs  should  also 
reflect  important  features  of  a  program.  Towards  this  end,  the 
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Program  Structure  defined  by  the  quintuple  {U,C^O,L^I  }  will  be 
used,  where 


0  -  Microoperation  Occurrence  Probabilities 

{U'i  1  Ui  =  P[the  microoperation  with  the  i*  th  op-code 
is  executed  ]} 

C  -  Predicate  Dependency  Probabilities 

{c^  I  Ci  =P[  a  CC  microoperation  is  located  i 
microoperations  in  front  of  the  branch]} 

0  -  Operand  Dependency  Probabilities 

{Oi  I  o^  =P[  a  non-branch  microoperation  m 

will  reference  the  result  of  a  microoperation 
i  positions  downstream]} 

L  -  Register  Value  Lifetime  Probabilities 

{li  I  li  =P[  the  register  value  created  by  a  microoperation 
will  lasL:  be  referenced  by  a  microoperation  i  positions 
dow  nstream  ]} 

-  A.verage  Issue  Interval 

The  average  number  of  microoperations  scanned  per  microoperation 
issued  given  that  multiple  issues  occurred  in  a  cycle. 

Standard  set  and  probability  notation  are  used  in  the  above 

definitions: 


{a  Ip}  is  the  set  of  elements  a  having  property  p 
P[A]  is  the  probability  that  event  A  occurs. 

It  is  assumed  that  the  above  distributions  are  time 
independent  and  reflect  dynamic  occurrences  in  a  microoperation 
stream  as  opposed  to  static  occurrences  in  a  microprogram. 


The  first  four  parameters  are  basically  the  same  as  those 
used  by  Bowra  and  Torng  [ BT76  ]  with  some  minor  modifications  that 
will  be  explained  later.  The  average  issue  interval  I^ will  be 
used  to  indicate  the  degree  of  difficulty  in  parallelism 
detection  that  the  Issuer  must  overcome. 
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These  parameters  can  be  empirically  obtained  by  collecting 
statistics  during  the  execution  of  a  program  on  some  data.  These 
statistics  will  be  highly  dependent  on  the  algorithm  that  the 
program  describes,  on  the  method  that  was  used  to  generate  the 
code,  and  on  the  operations  supported  by  the  machine.  The 
latter,  a  machine  dependency,  restricts  direct  performance 
comparisons  between  operationally  similar  machines  only. 
Consequently,  we  neglect  such  important  parameters  as  the  bit 
length  of  microoperations  and  data,  which  will  be  identical  for 
all  models. 

4.1.3  Predicate  Dependency  Pr obabil it ies 

This  distribution  gives  a  measure  of  the  forward  mobility 
possible  by  conditional  branches  in  a  program.  Conceptually,  it 
is  identical  to  the  operand  dependency  distribution  described  in 
section  4.1.5.  Because  of  the  pronounced  effect 

[  PLY6  6,  AST67,  FLY7  2  ]  branch  microoperations  have  on  stream  flow, 
predicate  dependency  is  treated  separately  from  operand 
dependencies. 

4.1.4  Microoperation  Occurrence  P robabilities 

The  dynamic  frequency  of  occurrence  for  a  given  op-code  of  a 
microoperation  can  be  empirically  obtained  for  a  given  program  by 
determining  the  number  of  times  it  occurs  in  an  execution  trace 

t 

of  the  program.  We  will  assume  that  the  occurrence  of  a 
microoperation  is  independent  of  any  previous  occurrence  of  a 
microoperation,  i.e.  P[op__code  m  occurs  |  op-code  n  occurred  j 
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positions  earlier  ]=P[  op-code  m  occurs]*  This  assumption  is 
traditional  in  analytical  models  [  SG69^FLY72, TS74 , FUL7 5,BT 76  ]. 
Experience  however,  indicates  that  this  is  not  the  case 
[FLY72,ALE72,PS77  ]. 

There  are  at  least  two -types  of  workload  where  the  assumption 
of  independence  could  affect  the  lER.  These  are  the  clustering 
of  microoperations  requesting  a  single  function  unit  type  and  the 
clustering  of  microoperations  with  long  execution  times.  Both 
types  would  increase  the  waiting  time  of  issuing  microope rati ons. 
Neglecting  either  of  these  effects  will  tend  to  inflate  the  lER 
estimate. 

4.1.5  0£erand  Dependencies  an d  the  ^ erage  Is sue  In te ryal 

The  operand  dependency  distribution  is  necessary  to  provide 
some  measure  of  the  sequential  binding  between  operations,  or 
conversely,  the  degree  of  operation  parallelism  in  the  program. 
It  is  used  to  determine  the  probability  that  a  microoperation 
must  wait  for  a  source  operand.  Bowra  and  Torng  [BT76]  use  a 
time  independent  distribution,  r^  ,  that  specifies  the  probability 
that  an  arithmetic  or  Boolean  microoperation  requires  the  result 
or  result  register  of  the  j*th  previous  instruction.  Inclusion 
of  the  result  register  forces  too  broad  an  interpretation  on  oj  . 
Some  of  the  Control  Policies  we  examine  require  a  finer  level  of 
information.  Result  register  considerations  will  be  treated 
separately  with  the  register  value  lifetime  distribution  in 
section  4.7.  Two  interpretations  of  oj  and  the  methods  used  to 
obtain  the  respective  statistics  will  be  discussed. 
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The  straight  forward  interpretation  suggests  the  following 
statistics  gathering  procedure: 

Examine  an  execution  trace  of  a  program  and  its  data,  and 
take  statistics  for  the  distances  (in  #  of  microoperations) 
between  a  source  specification  and  its  creating 
microoperation.  In  the  case  of  immediate  operands,  increment 
the  statistic  for  distanceoo. 

One  weakness  of  this  dependency  measure  is  that  it  is 
deceptively  non-unigue  in  its  representation  of  parallelism. 
Programs  with  the  same  structure  can  have  a  different  set  of 
statistics  and  programs  with  substantially  different  structures 
can  have  the  same  statistics.  This  is  illustrated  by  the  program 
structures  in  figure  4.2,  along  with  their  statistics.  A  block 
code  is  given  in  the  form  of  a  precedence  graph.  Operator 
precedence  is  shown  by  the  edge  direction,  an  operand  dependency 
by  the  opposite  direction.  Stream  order  is  given  by  the  node 
numbering. 

Program  A  has  the  same  statistics  as  program  B  although  the 

parallelism,  by  almost  any  other  measure,  in  B  is  greater. 

Structure  C  displays  an  instruction  ordering  that  exposes  the 

\ 

potential  parallelism  of  B.  To  exploit  the  parallelism  in  B,  a 
scan  rate  of  7  microoperations/cycle  would  be  needed.  Structure 
C  would  require  a  scan  rate  of  only  3  microoperations/cycle. 
Thus  although  the  statistics  of  B  and  C  do  not  indicate  the  same 
parallelism,  they  do  indicate  that  the  exploitation  of  the 
parallelism  would  be  more  difficult  for  structure  B. 
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A  somewhat  more  complicated  statistics  gathering  technique 
which  is  more  indicative  of  the  potential  parallelism  in  a 
program  is  suggested  by  the  precedence  graph.  First  the 
microoperations  are  partitioned  into  levels  as  described  in 
section  3.8  and  in  [ACD74],  This  is  illustrated  in  figure  4.3 
for  structure  A.  Observe  that  an  operation  at  some  level  i 
cannot  be  initiated  until  some  operation  at  level  i+1  has  been 
executed.  However,  operations  sharing  a  given  level  can  be 
executed  in  any  order  once  all  of  the  operations  in  the  adjacent 
higher  level  have  been  executed.  Using  this  property,  an  operand 
referencing  statistic  that  will  indicate  the  parallelism 
independent  of  the  stream  ordering  can  be  obtained.  Assuming 
that  operations  in  level  i  randomly  reference  operations  in  level 
i+1,  a  mean  referencing  distance  for  microoperations  at  level  i 
can  be  defined  to  the  ’middle*  microoperation  at  level  i+1. 

Let  n  be  the  number  of  microoperations  at  level  i+1  and 

m  be  the  position,  from  the  left,  of  a  microoperation 
p  in  level  i. 

Then  the  microoperation  p  to  its 

nearest  source  creating  microoperation  is  defined  to  be  the 
distance  m-1  +  Tn/il  .  This  measure  is  somewhat  arbitrary  but  it 
does  provide  the  same  dependency  for  programs  whose  precedence 
graphs  have  the  same  structure.  Figure  4.3  illustrates  its 
application  to  various  program  structures. 

The  associated  probability  of  this  statistic  reflects  only 
the  probability  to  the  nearest  reference.  If  the  microoperations 
have  dyadic  operators,  a  transformation  can  be  used  to  generate 
the  defined  operand  reference  distribution,  {o^}. 
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Let  dj^  be  the  distance  between  a  microoperation  and  the 
microoperation  that  creates  its  i*th  source  value,  i=1r  or  2- 
Thus  for  a  dyadic  microoperation,  c^=P[d^=iv  d£=i].  The 
probability  statistics  gathered  by  the  method  outlined  above  are 
for  a  random  variable 
X  =  min  (d-j  ,d^  }  • 

If  d^  and  d^  are  independently  and  identically  distributed  then 
P[X=i]=P[di  >iAd2>  i]  -  P[d^  >i+lAd2>i+1  ] 


Equat ion 
P[d 


Equat ion 
starting 


=P[d 

>if-  p[ 

d  >i+ 

1P 

=P[d 

=  i]  (PCd 

=i] 

+  2P[d 
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=P[d 

=  i](P[d 

=  il  + 

2-  2P[d 

<i])  - 

=P[d 

=  i](2  - 

P[d  = 

i]-2P[d 

<i]) 

(4. 

1) 

(4.1) 

is  a  quadratic 

in  P[  d 

=i  ].  Its  solution  i 

s 

=i]  = 

1  -  P[  d  < 

i]- 

/(i-p[  d 

<i]f  -  P[x=i]' 

/  i-'' 

\ 

= 

1- 

=  j]- 

7(1 -Zic 

a  =j])‘  -  p[x=i] 

ij  =  i 

(4. 

2) 

(4.2) 

is  a 

recur 

rence 

relation  that  can  be 

sol ved. 

with 


p[a  =1]  =  1  -  \/i-p[x=i ]' 


(4.3) 


Thus  for  a  dyadic  microoperation 

=  1  -  P[d  =i]2 


(4.4a) 


For  monadic  microoperations 

Oil  =  P[X=i]  (4.4  b) 

Above,  we  discussed  issues  on  methods  of  representing  the 
parallelism  in  a  program  graph.  A.  related  problem  is  that  of 
parallelism  detection  within  a  microoperation  stream.  The 
program  structures  B  and  C  in  figure  4.3  have  the  same  degree  of 
parallelism,  but  the  microoperation  ordering  of  structure  C 
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allows  easier  detection*  For  structure  C,  a  scan  rate  of  3 
microoperations/cycle  will  detect  the  parallel  mic rooperations 
2,3,4*  For  structure  B,  the  Issuer  must  have  a  scan  rate  of  at 
least  7  to  detect  the  parallel  microoperations  2,5  and  8* 
Examining  the  precedence  levels  generated  in  figure  4.2,  it  is 
seen  that  the  scan  rate  required  to  detect  all  parallel 
microoperations  at  level  i  is  given  by  the  expression: 

+  1  —  m 


where 


is  the  scan  rate  required  for  level  i 
Mi  is  the  maximum  microoperation  number  at  level  i 
m^  is  the  minimum  microoperation  number  at  level  i 

The  scan  rate  ignores  the  microoperations  that  may  have 

already  been  issued  from  level  i.  Such  occurrences  will  happen 

in  the  course  of  Issuer  actions,  but  we  will  not  consider  these 

secondary  effects  in  our  performance  analyses.  We  define  an 

issue  interval,  I^ ,  to  reflect  the  degree  of  difficulty 

in  detecting  parallelism  in  a  program 


N 


(4.5) 


i»1 

where  the  sum  is  over  all  levels,  i,  and  N  =  the  number  of 

\ 


microoperation  levels 


The  parameter  I^will  be  incorporated  into  the  Scan  State  analysis 
of  adaptive  processors. 
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4,1 ,6  Peg ister  Value  Lifetime  Probabilities 

^  register  value  1 if et ime  is  defined  as  the  number  of 
microoperations  in  the  stream  between  a  microoperation  that 
creates  a  register  value  and  the  last  microoperation  to  reference 
that  value.  The  distribution  associates  probabilities  with 
register  lifetimes.  It  will  be  used  to  determine  the  probability 
that  a  result  register  will  be  available  for  an  initiating 
micro operation. 

A  register  value  lifetime  [L0N77]  may  be  determined  by 
scanning  an  execution  trace  of  microoperations  and  maintaining 
two  counters,  a  running  count  and  a  last  reference  count.  The 
running  count  is  initialized  when  a  register  value  is  created. 
It  is  incremented  by  1  for  each  following  microoperation  that  is 
scanned.  The  last  reference  count  stores  the  number  of 
microoperations  scanned  to  the  last  reference  of  the  register. 
It  ultimately  becomes  the  lifetime  when  a  new  value  is  created 
for  the  register.  Statistics  are  maintained  for  lifetime  values 
and  are  then  converted  into  probabilities  when  the  trace  scan  is 
completed . 
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4.1.7  0£e ration  Classes 

Given  the  machine  configuration  and  a  program  structure,  an 
assignment  of  microoperations  to  function  units  must  be  made. 
This  mapping  is  accomplished  through  a  microoperation- function 
unit  table  which  contains  an  entry  for  each  microoperation.  Each 
microoperation  entry  is  a  list  of  ordered  pairs,  (f , t) ,  that 
specifies  a  function  unit  type  and  an  execution  time.  in  this 
analysis,  we  assume  that  a  microoperation  is  executed  by  only  one 
function  unit  type.  By  convention,  the  function  unit  executing 
branch  microoperations  will  always  be  of  type  1. 

Let  M  be  the  set  of  all  microoperations  and  let  m  be  the 
i*th  microoperation  type.  It  will  be  both  descriptively  and 
computationally  convenient  to  define  certain  classes  of 
microoperations.  The  following  definitions  partition  M  into  time 
classes  and  function  unit  classes  or  f-classes: 

EX(m  ,f)  =  the  execution  time  of  microoperation  m  on  an 

f-function  unit. 

T„  =  Max  {EX  (m  ,  f )  |  meH} 

M  (f  ,  t)  =  (m  €  Ml  EX  (m  ,f)  =t) 

F 

Mt(t)  =UM(f,t)  F  is  the  number  of  f- types. 

f=-T  -p 

Mf(f)  =  U  M(f,t)  Tfv,  is  the  max  execution  time. 

1‘1 

The  above  definitions  may  be  used  in  determining  the  loading 
on  function  units.  Branch  and  memory  microoperations  require 
further  classification. 

Branch  microoperations  have  the  capability  to  alter  the 
sequence  of  microoperations  by  replacing  the  contents  of  the 
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program  counter  with  a  branch  address.  For  the  basic  processors, 
we  consider  only  two  kinds  of  branch  address  calculations; 

1 •  Branch  Address  < —  ♦  +  disp 

2.  Branch  Address  < —  immediate  constant 

where 

*  is  the  address  of  the  branch  microoperation, 
disp  is  a  constant  in  the  displacement  field  of  the 
microoperation,  and 

immediate  constant  is  a  constant  occupying  the 

microoperation  field  following  the  branch  opcode. 

A  conditional  branch  must  first  evaluate  a  predicate  to 
determine  if  the  branch  address  is  to  be  used.  Predicate 
evaluation  will  depend  on  the  execution  of  a  CC  microopera ticn 
which  will  set  specified  condition  codes  to  reflect  the  execution 
outcome.  The  test  performed  on  the  condition  codes  is  specified 
by  a  field  in  the  conditional  branch  raicr ooperation. 

Neither  conditional  nor  unconditional  branch  microoperations 
need  a  destination  register  other  than  the  program  counter;  thus 
their  issue  is  not  impeded  by  a  work  register  shortage.  As  the 
address  components  are  immediate,  branch  address  calculations  are 
not  delayed  by  operand  waits.  Only  conditional  branches  may  be 
delayed  by  previous  CC  microoperations.  Both  types  of  branches 
act  as  delimiters  of  the  Window  contents  in  the  microoperation 
stream.  However,  if  the  control  unit  has  a  lookahead  facility, 
the  branch  address  can  be  made  available  the  cycle  after  that 
branch  is  detected.  Consequently,  an  unconditional  branch  will 
produce  minimal  disruption  to  the  microoperation  stream, 

Memory  microoperations  move  data  between  the  memories  and  the 
work  registers.  The  Load  microoperation  transfers  1  unit  of  data 


4-18 


from  a  specified  memory  cell  to  a  specified  register.  The  store 
microoperation  performs  the  opposite  action.  As  both  types 
require  a  memory  address  which  is  contained  in  a  register,  they 
are  subject  to  operand  delays  due  to  address  computations.  In 
addition  the  load  requires  a  destination  register  to  receive  the 
transferred  result  while  the  store  specifies  an  additional 
operand  register  whose  contents  are  to  be  stored.  Both  of  these 
requirements  may  result  in  issuing  delays. 

To  accommodate  these  differences  in  memory  and  branch 
microcperations,  let 

Mtv  (t)  =Mt  (t)  -  {m  im  is  a  store  or  branch  microoperation} 

Mv  =  Mtv(t) 

Mv  is  the  set  of  register  value  creating  microoperations.  Only 
microoperations  from  this  set  can  cause  operand  waits. 

Memory  microoperations  require  special  treatment  in 
environments  supporting  the  automatic,  concurrent  execution  of 
microoperations.  In  general,  it  is  difficult  to  determine  the 
memory  address  within  the  time  scale  of  the  scanning'  process  to 
detect  potential  hazards.  For  worst  case  hazard  prevention,  it 
is  necessary  to  preserve  the  stream  order  of  memory 
microoperations  in  execution.  If  the  memory  management  system 
does  not  have  a  FIFO  memory  request  queue,  memory  microoperations 
cannot  be  issued  around  others  delayed  in  the  Window.  This  will 
contribute  to  the  Window  blockage. 
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4,2  Bovra  and  Torng  Issue  Del^  Model 


Bowra  and  Torng’s  processor  model  [ ET76  ]  is  briefly 
described.  This  model  includes  many  aspects  of  multiple  function 
unit  processors  within  a  relatively  simple  computational 
framework.  The  assumptions  used  and  their  control  policy  are 
discussed,  followed  by  a  derivation  of  their  equations.  The 
derivation  is  identical  except  for  changes  in  notation  to  that 
found  in  reference  [BT76].  It  is  included  as  a  convenience  to 
the  reader.  The  section  closes  with  a  modification  to  their 
model  to  include  the  effects  of  operand  waits  to  main  memory 
operations  and  of  time  slippages  to  the  issue  delay  between  the 
issue  of  successive  instructions.  In  view  of  the  high  incidence 
of  memory  instructions  in  typical  program  structures 
[  ALE72,  M?l74  ,  PS77  ]  and  of  the  importance  of  the  computed  issue 
delay  to  the  lER,  it  would  appear  useful  to  include  these  effects 
while  retaining  the  computational  frame  work  of  the  model. 


Bowra  and  Torng  model  single  instruction  stream,  single  data 
stream  processors,  where  all  instructions  pass  sequentially  to  an 
Instruction  Unit  that  sequentially  issues  instructions  to 
function  units.  They  observe  that  the  lER  is  the  inverse  of  the 
average  time  between  the  instant  an  instruction  fetch  begins  and 
the  i rstant  it  is  issued.  They  consider  two  components  in  this 
time,  the  fetch  delay,  DS ,  and  the  issue  policy  delay,  DP. 

Their  provision  policy  is  implicitly  stated  and  its  effect  is 

f 

reflected  in  the  expression  for  DS.  They  assume  that  a  delay  of 
1  machine  cycle  is  incurred  for  an  instruction  fetch,  except  for 
those  following  a  branch.  In  this  case,  an  average  value  of 
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cycles  is  assumed.  Thus  if  the  occurrence  probability  of  a 
branch  is  p.  ,  then 

o 

DS  =  PbEb  +  (4.6) 


The  issue  policy  delay,  DP,  is  determined  from  considerations 
of  their  issue  policy.  The  issuing  instruction  I  and  the 
processor  state  must  satisfy  the  following  conditions: 


1 .  There  is  at  least  one  function  unit  available  for  the 
execution  of  I. 

2.  The  register  that  is  specified  as  the  destination  by  I 
cannot  be  the  destination  of  any  executing  instruction. 

3.  The  operands  of  I  cannot  be  the  results  of  instructions 
being  executed. 

Conditions  2  and  3  as  stated,  prevent  issuing  of  instructions  if 


any  preceding  instruction  is  blocked.  otherwise,  an  early  read 
or  write  hazard  can  occur  because  a  blocked  instruction 


specifying  a  destination  d,  is  not  executing,  and  thus  cannot 


prevent  a  microoperation  specifying  d  as  either  source  or 


destination  from  issuing.  This  effectively  implies  a  maximum 
scan  rate  of  1. 


The  program  structure  parameters  used  in  the  Issue  Delay 
model  are: 


p^  -  the  dynamic  occurrence  probability  of  instruction 
of  type  i 

r^  -  a  time  independent  probability  distribution  that 
an  instruction  requires  the  result  or  result 
register  of  the  i*th  previous  instruction 
bj^  -  a  time  independent  probability  distribution  that 

a  branch  requires  the  result  of  the  i*th  previous 
instruction 

* 

The  machine  parameters  are: 

n^  =  the  number  of  f- function  units  in  the  machine 
n^A  =  the  number  of  memory  modules  in  the  memory  system 
EX(m  ,f)  =  the  execution  time  of  microoperation  m  on 

an  f-function  unit 
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Tk,  =  MAX{EX(m  ,f)  ) 

=  the  cycle  time  of  a  main  memory  module 
=  occurrence  probability  of  a  branch  instruction 
p^  =  occurrence  probability  of  a  memory  instruction 
M  =  the  set  of  all  instructions 

Wft(frt)  =  {m  M  I  EX  (m  ,f)  =t  A  m  executed  by  f- function  units} 
F 

Mt(t)  =UMft(f,t) 

■f'2. 

nf(f)  =  LrMft(f,t)  .  . 

Mtv(t)  =  Mt(t)-{m|m  a  store  inst.}-{m|m  a  branch} 

Mv  is  the  set  of  value  creating  microoperations. 


There  are  three  components  to  DP  -  DPR,  DPC  and  DPF.  These 
are  defined  below. 


DPR  =  Mean  Issue  Delay  to  an  arithmetic  or  Boolean  Instruction 

at  the  issue  point  due  to  an  operand  or  result  register  wait. 

=  P[an  arithmetic  or  Boolean  instruction  at  issue  point]  :: 
P[k*th  previous  instruction  causes  delay]  x  Kean  delay  from 
k*th  prior  position 


The  probability  that  DPR  will  occur  is 


FFE  =  (1-p^-p„)2l  r.Z  P[Mty(t)  ] 


(“.S) 


DPC  =  Mean  Issue  Delay  to  a  conditional  branch  at  issue  point 
^  nT  1  '  I  '^ 

(^  .  9  ) 


Pb^l  \  P  [Mtv(t)  ] 


DPF  =  Mean  delay  due  to  waits  on  busy  function  units 


=  2^  [mean  delay  for  an  arithmetic  or  Boolean  function  unit] 

[mean  delay  by  an  f-function  unit]  x  P[  all  f-f unction  units  busy] 
f  branch  unit  or  memory 

=21[ZI  P[Mft(f,  t)  ](t-n^)/2][IER/n^2-^^P[Mft(f,t)  ] 
t=n^+1 

f  branch  unit  or  memory 


(4.10) 
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The  issue  policy  delay  is  formed  by  summing  the  above 
components.  A  correction  factor  of  (1-PPR)  is  applied  to  the 
mean  delay  due  to  busy  function  units  because  the  simultaneous 
occurrence  of  a  DPR  and  DPF  is  negligible. 

D  =  DS  +  DPR  +  DPC  +  (l-PPR)DPF  (4.  11) 

After  manipulating  equation  (4.11)  the  following  polynomial 
results: 


2Z.  A^  (lEE)''^^  +B(IER)  -  1  =0  (4.12) 

f  ^  branch 
unit  or 
memory 

where  B  =  DS  ♦  DPC  +  DPR  .  (4.  13) 

and  A,  =21 

x(l-PPE)  t=1 

(4.14) 

The  Issue  Delay  model  has  the  advantage  of  having  relatively 
modest  computational  requirements.  Among  the  simplifications 
used  in  its  derivation  are  that  the  accumulated  delays  between 
microoperation  issues  and  the  issue  delay  to  memory  instructions 
is  negligible.  These  effects  are  included  in  a  modified  model 
described  below.  < 


The  issue  delay  to  memory  instructions  is  included  by 
altering  the  first  term  in  equations  4.7  and  4.8  to  (1-p^).  This 
assumes  that  the  behaviour  of  memory  instruction  operand 
references  is  adequately  described  by  r^.  In  addition,  an  A^ 
term  (equation  4.14)  must  be  included  for  memory. 


The  delays  of  DPR,  DPC,  and  DPF  were  calculated  without 
including  the  time  slippage  due  to  the  delay  between  previously 
issued  instructions.  This  is  corrected  by  including  slippage 
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corrections  to  the  equations  defining  the  above  terms,  based  on 
an  estimate  of  the  lER.  The  corrections  assume  that  each  issue 
contributes  the  average  delay.  An  iterative  procedure  is 
necessary  to  obtain  a  final  lER. 

The  slippage  correction  changes  the  limits  in  the  summations 
and  the  delay  time.  Equations  (4,7)  to  (4.10)  are  given  in  their 
modified  form  below, 

DFEH  =  (1-Fb)21  ^  (‘*•15) 

The  upper  limit  of  the  first  summation,  K^,  is  the  relative 
index  of  the  earliest  issued  instruction  that  can  affect  issue 
delay.  It  is  obtained  from  the  following  considerations. 

To  contribute  to  delay,  the  earliest  instruction  must  have 
the  maximum  execution  time,  If  it  is  the  K  ’  th  previous 
instruction,  the  following  relation  must  hold: 

(K^-1)/IER  +DS  <  T^. 

The  first  term  is  the  expected  accumulated  time  between  the  issue 
of  the  th  and  first  previous  instructions,  where  1/IEF.  is  the 
expected  time  between  the  issue  of  successive  microoperations. 
This  can  be  seen  with  the  aid  of  the  diagram  below. 

K„  K„_, -  3  2  10 
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The  second  term  on  the  left  hand  side  is  the  average  fetch  delay. 
Eearranging  the  relation,  we  obtain 

<  (T^-DS)  lER  +  1 

Thus  is  obtained  taking  the  floor  function: 

=  1(Th-DS)  IER  +1J  (4.16) 

The  quantity  tj^  in  equation  4.15  is  the  minimum  execution 
time  that  the  k*th  previous  instruction  must  have  to  contribute 
to  operand  delay.  From  the  previous  diagram,  this  is  seen  as 

t|^  =  DS+(k-1)/IER  (4.17) 

These  summation  limits  are  also  used  in  equations  4.18  -  4.20 
below .  _ 

K|.<v  '  M 

PFEM  =  (1-Fw)21  rk21  P[Ktv(t)  ]  (4.18) 

DFCM  =  p.2r  bu2i(t-t^)P[Mtv  (t)  ]  (4.19) 

K.1 

T-m  ^ 

DPFM  =  [  2_  [  (t-t^  )/2  ]  P[  Mft  (f,t)  ][IEE/n^  ^  tP[  fl  f  t  (f  ,  t)  ]  ]  + 

t^rt  n 

f kbranch  (4.  20) 

+  [p^(t^-DS)/2  1[  t^IEP  (Upj/n^] 

The  derivation  of  t^  is  similar  to  the  derivation  of  t^^.  It 
is  the  minimum  execution  time  that  an  f-function  unit  must  have 
to  be  able  to  cause  a  delay  given  n^  f-function  units. 

t^  =  DS  +  (n^-1)/IER  (4.21) 

The  additional  term  includes  the  component  for  the  memory  system. 
Memory  is  assumed  to  be  composed  of  n^  identical  modules  that  can 
be  independently  accessed  with  a  cycle  of  t^r^.  Instruction  stream 
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loading  is  included  in  the  second  term.  It  is  assumed  to  be  lER 
instructions  per  cycle.  Because  a  memory  access  is  directed  to  a 
particular  module,  the  second  term  reflects  the  busyness  of  a 
selected  module. 


Equations  (4.11),  (4.13),  (4.14)  become 

DM  =  DS  +  DPRM  +  DPCM  +  (1-PPRH)  DPFM  (4.  22) 

E*  =  DS  +  DPCM  +  DPRM  (4.  23) 


A*rv,=  [P^(V-DS)/2][t^(1+p^)/n^]  (1-PPFM)  (4.24) 

The  steps  below  describe  the  proposed  iterative  procedure: 

1 .  Find  an  initial  estimate  for  IER=IER. . 
i=1  , 

2  .  Evaluate 

1  DS, DPRM, DPCM ,PPRM 

2  B*  ,A^ 

3.  Using  B*  and  the  in  equation  (4.12),  find 
a  root  in  interval  (0,1)  which  is 
lERi 

4  IF  I  lERi  -lER^.-j  I  >  e  an  error  tolerance  */ 

THEN  i  <-  i+l 
GO  TO  2 

ELSE  lER^  is  the  final  estimate. 

The  change  in  the  lER  estimate  due  to  these  modifications  to 
the  Bowra  and  Torng  model  will,  in  general,  depend  on  the  model 
inputs.  The  time  slippage  between  successive  issues  will  tend  to 
increase  the  lER,  while  the  operand  waits  of  memory  instructions 
will  tend  to  decrease  the  lER. 


^-2e 


Proce sspr  Models  with  Lopl^Ahead  and  Concurrent  Initi atio ns 

The  Bowra  and  Torng  model  described  in  the  previous  section 
includes  many  important  features  of  computer  systems.  The 
following  work  provides  a  more  general  framework  that  includes 
the  effects  of  the  Converger,  Conditioning  schemes, 
microoperation  buffering,  source  buffering,  and  destination 
forwarding.  The  framework  also  allows  the  modeling  of  more 
general  control  policies  and  the  capability  to  simultaneously 
initiate  concurrent  microoperations.  The  latter  feature  allows  a 
more  direct  modeling  of  horizontal  microprog rammable  control 
units . 

The  context  of  our  modeling  will  be  described  with  reference 

to  figure  4.4  that  displays  the  relative  importance  or  severity 

of  various  classes  of  waits  that  degrade  computer  system 

performance.  I/O  waits  arise  when  computation  is  suspended 

because  either  program  or  data  are  not  available  in  memory  for 

execution.  I/O  service  requires  data  transfers  from  secondary 

storage  or  other  peripherals  and  causes  waits  whose  magnitudes 

\ 

overwhelm  those  that  arise  farther  down  the  hierarchy.  Given 
that  there  are  no  I/O  waits,  machine  instruction  waits  can  occur 
and  cut  off  the  flow  of  microprograms  and  idle  the  processor. 
The  character  of  these  waits  is  analogous  to  microoperation 
interpretation  waits  and  will  not  be  directly  described  here. 
Given  a  continuous  supply  of  microprograms,  (or  machine 
instructions)  microoperation  waits  are  the  major  cause  of 
computation  stoppage.  These  are  affected  by  the  provision 
policy,  the  supply  rate  of  the  Control  Memory  system,  and  the 
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incidence  of  branches  along  with  the  Stream  Controller’s  ability 
to  cope  with  them.  After  the  microoperation  supply  problem  is 
resolved,  operand  waits  are  the  next  major  source  of  computation 
stoppage.  Operand  waits  are  a  property  of  the  program  structure 
but  they  can  be  alleviated  by  providing  faster  function  units, 
and  to  some  degree,  by  providing  more  function  units.  Operand 
waits  are  considered  more  serious  than  resource  waits  for  either 
function  units  or  registers.  Resource  waits  can  be  overcome  by 
providing  more  resources. 

For  the  basic  processor  class,  we  will  consider  only  the 
microoperation,  operand,  and  resource  waits.  We  assume  that 
waits  at  a  higher  level  do  not  influence  the  computation.  If 
necessary,  their  effects  can  be  included  by  computing  their  mean 
wait  times  and  their  probability  of  occurrence. 
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Basic  Processor 
Waits 


Figure  4.4  Computer  System  Waits 


An  overall  system  performance  can  then  '  be  computed. 
Decom posabili ty  techniques,  as  described  by  Courtois  [C0U75]  and 
I/O,  memory,  and  processor  interference  models  as  described  by 
Shemer  and  Gupta  [ SG6 9  ]  and  Smith  [SMI74]  could  be  used. 

The  type  of  computer  that  can  be  described  by  our  performance 
models  is  the  single  instruction  stream,  single  data  stream 
processor.  The  models  are  decomposed  into  two  parts  -  the  Window 
model  and  the  Scan  Space  model.  The  Window  model  describes  the 
microoperation  prefetch  process.  For  adaptive  processors,  it 
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describes  the  capability  of  the  Stream  Controller  and  Control 
Kemory  system'  to  provide  A.ctive  microoperations  for  execution. 
The  effect  of  provision  policies,  branches.  Conditioning  schemes. 
Control  Memory  system  transmission  capability,  and  Converger 
capacity  are  included  in  the  Window  model.  These  effects  are 
characterized  by  a  random  variable  W  such  that  P[W=i]  gives  the 
probability  that  an  Active  microoperation  resides  in  Window 
position  i. 

The  second  part,  the  Scan  Space  model,  describes  the 
capability  of  the  Issuer  to  issue  microoperations  to  the  Control 
Buffer  and  the  operation  unit’s  capability  to  execute  non-branch 
microoperations.  The  Scan  Space  model  characterizes  the 
disposition  and  execution  policies  as  sets  of  probabilistic 
events  whose  probabilities  are  determined  from  the  machine 
configuration  and  program  structure.  The  Scan  Space  model  uses 
the  density  W(i)  and  the  probabilities  associated  with  events  of 
issuing  and  execution  to  determine  a  probability  density  for  I, 
the  number  of  microoperations  issued  in  a  scan  cycle.  The 
expected  value  of  this  distribution  will  then  give  the  mean 
number  of  microoperations  issued  per  cycle  when  the  processor  is 
in  a  steady  state  situation. 

A  simple  model  of  the  computation  process  in  a  basic 
processor  is  illustrated  in  figure  4.5.  There  are  bursts  of 
computation  of  some  random  length  T  punctuated  by  periods  of 
inactivity  of  random  length  B.  The  inactivity  is  caused  by 
interruptions  to  the  microoperation  stream  by  branch  waits.  lER 
is  the  execution  rate,  including  the  effect  of  branch  waits.  We 
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Execution  Rate 


Figure  4.6  Computation  Burst 
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will  also  use  lER* ,  the  execution  rate  not  including  branch 
waits.  From  the  diagram,  IFF.  and  lER*  are  related  by 

lER  =  IER»  E[T]/(E[T]+E[B])  (4.25) 

where  E[  T  ]  and  E[B]  are  the  expected  values  of  T  and  B 
respectively. 

Figure  4.6  provides  greater  detail  of  a  computation  burst. 
The  activity  profile  illustrates  the  execution  activity  of  some 
statistically  averaged  block  requiring  E[ T  ]  cycles  to  execute, 
preceded  by  an  idle  period  of  length  E[B].  A  start  up  transient 
occurs  because  resource  utilization  and  operand  wait  are 
decreased  by  the  preceding  idleness.  Consequently  the  expected 
initial  rate  of  issue  will  be  higher.  Following  the  transient  a 
steady  state  may  be  reached  and  the  average  level  of  computation 
becomes  constant.  The  profile  tails  off  as  the  remaining  number 
of  block  microoperations  diminishes  to  zero. 

The  Window  model  quantifies  the  effects  of  the  branch  waits 
E[B],  and  the  tailing  off  of  the  computation  activity.  The  scan 
space  model  uses  the  results  of  the  Window  model  to  establish  the 
existence  of  an  active  microoperation  that  is  a  candidate  for 
issue.  The  scan  space  model  determines  the  lER'  level  of 
activ ity . 
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4,4  The  Window  Models 

The  Window  models  characterize  the  fetching  of 
microoperations.  The  Window  is  that  part  of  the  Converger  that 
is  accessible  to  the  Issuer.  For  the  adaptive  microprog ramm able 
control  units  described  in  chapter  3/  this  consists  of  the  Active 
Arm  and  the  conditional  T  and  F  arms.  The  fetching  of 
microoperations  in  a  adaptive  processor  will  be  described  by  the 
Converger  Window  model  derived  in  the  next  section.  On 
conventional  microprogrammable  control  units,  there  is  no 
physical  Window.  Conceptually,  however,  there  is  a  logical 
Window  that  is  used  by  the  microprogrammer  as  a  microprogram  is 
generated.  It  consists  of  the  set  of  microoperations  belonging 
to  the  block  that  is  currently  being  implemented  in  microcode. 
The  properties  of  this  logical  Window  will  be  captured  by  a 
separate  microprogram  Window  model  in  section  4.4.2. 

The  Window  model  computes  three  primary  components.  The 
first  two  are  E[  B  ]  and  E[T],  the  expected  lengths  of  the  branch 
wait  and  computation  bursts,  respectively.  They  are  used  in 
equation  (4.25)  to  compute  lER.  The  third  output  is  a 
probability  distribution  W  (i)  that  gives  the  probability  that  the 
Issuer  will  find  an  active  microoperation  in  Window  position  i. 
W(i)  is  essential  to  permit  a  direct  modeling  of  concurrent 
initiations.  It  is  used  in  the  scan  space  model. 

4.4.1  The  Converger  Window  Model 

To  provide  a  background  for  this  Window  model,  a  brief  review 
of  Converger  dynamics  is  given.  Assume  that  the  Window  is 
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initially  empty  and  that  microoperations  are  entering  from  the 
control  memory  system  at  a  rate  of  S  microoperations/cycle.  As 
they  enter#  the  Issuer  begins  to  examine  microoperations  and 
issues  them  at  a  rate  of  lER  microoperations/cycle. 
Consequently,  microoperation  build-up  in  the  Window  occurs  at 
rate  S-IER*  Build-up  continues  until  the  Active  arm  is  filled, 
the  overflow  microoperations  entering  the  F  arm.  This  is  a 
natural  buffer  for  the  Active  microoperations  because  the  fall- 
through  block  would  enter  into  the  F  arm.  If  an  unconditional 
branch  is  detected,  microoperation  inflow  to  the  Converger  would 
temporarily  cease  for  one  cycle  while  the  branch  address  is 
computed.  During  this  time,  the  Window  contents  would  be 
emptying  at  rate  lER.  Recall  that  when  an  unconditional  branch 
is  detected  in  the  Stream  Controller  (section  3.3.2),  fetching  of 
microoperations  ceases  while  the  branch  address  is  determined. 
The  effect  of  unconditional  branches  will  be  included  by  using  a 
reduced  supply  rate,  S»  =  S(1-Py)  where  p^  is  the  occurrence 
probability  of  an  unconditional  branch. 

If  a  conditional  branch  is  detected,  a  FORK  operation  would 
set  up  two  conditional  arms,  F  and  T.  The  Control  Memory  supply 
rate  would  then  be  equally  shared  by  the  newly  created 
conditional  arms  at  rate  S*/2.  Also,  the  supply  to  the  Active 
arm  would  terminate  and  the  active  Window  level  would  be 
decreasing  at  rate  lER.  Each  detected  conditional  branch  at  this 
point  would  again  set  up  two  new  conditional  arms  at  a  higher 
Converger  level,  halving  the  previous  supply  rate  to  the  newly 
created  program  arms.  Eventually,  the  CC  microoperation  in  the 
active  arm  would  be  executed  and  the  predicate  evaluated. 
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Depending  on  the  CC  microoperation  advance  relative  to  its  branch 
and  its  execution  time,  there  may  be  residual  microoperations 
from  the  block  tail  that  axe  executed  along  with  the  newly 
activated  mic rooperations.  On  the  other  hand,  several  cycles  may 
pass  before  an  -arm  is  activated,  waiting  for  predicate 
resolution.  In  either  case,  immediately  afi:er  predicate 
evaluation,  a  set  of  activat-ed  microoperations  is  presented  to 
the  Issuer.  An  expression  for  the  expected  size  of  this  set  will 
he  derived  below. 


After  the  CC  microoperation  is  initiated,  issuing  may  cease 

t 

because  no  A.ctive  microoperations,  remain .  This  may  occur  for  a 
succession  of  B  cycles,  where  B  is  the  random  variable  denoting 
the  number  of  cycles  execution  is  suspended  for  branch  waits. 
During  this  time,  W=0,  where  K  is  the  random  variable  for  the 
number  of  active  microoperations  in  the  Window.  Thus  P[W=0]  will 
be  proportional  to  the  expected  value  of  E,  E[B]. 


The  Window  model  components  are  found  as  follows: 


1.  E[E]  and  r[  T  ]  are  derived  using  elementary  probability 
theory  on  probability  distributions  of  the  program 
structure. 

2.  P[ W=i  ]  is  derived  as  a  geometric  distribution  with  mean 

(E[c,-]  and  E[Wo])  .  E[Cr]/  the  expected  number  of 
residual  microoperations,  is  found  using  elementary 

probability  theory  on  probability  distributions  of  the 
program  structure.  E[  Wo  ]  is  then  derived  using  elementary 
queueing  theory.  This  derivation  is  quite  detailed. 

To  compute  E[B],  the  following  assumptions  are  made: 

1.  The  occurrence  of  a  branch  microoperation  in  the  stream  is 

statistically  independent  of  other  microopera tion 

occurrences  and  time  independent. 

2.  When  the  CC  microoperation  is  initiated,  c  microoperations 
remain  in  the  block  to  be  executed,  with  probability  b^^  . 
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3.  The  predicate  is  evaluated  when  the  CC  microoperation 
completes  execution. 

4.  All  previously  initiated  microinstructions  contain  lER ' 
microoperations.  This  allows  us  to  fix  the  location  of  a 
CC  microoperation  in  a  previously  initiated 
microinstruction.  lER *  is  used  because  it  is  known  that 
no  branch  waits  occur  in  this  interval. 

The  following  variables  and  notation  will  be  used: 


Me 

Mtc (x) 


Pc 


c 


lER 

PrW 

B 

N 


-  a  microinstruction  containing  a  CC  microoperation 

-  the  set  of  microoperations  that  set  condition  codes  and 
whose  execution  time  is  x  cycles. 

-  the  maximum  execution  time  of  a  condition  code  setting 
microoperation • 

-  the  probability  of  occurrence  of  a  conditional  branch  or 
equivalently,  a  CC  microoperation. 

-  random  variable  specifying  the  input  stream  displacement 
in  microoperations  between  the  CC  microoperation  and  its 
conditional  branch. 

-  b^  =  P[c=i] 

-  number  of  microoperations  executed  per  cycle 

random  variable  specifying  predicate  resolution  time 
from  the  instant  of  CC  microoperation  initiation. 

branch  wait  time  for  an  adaptive  microprogramm able 
control  unit. 

-  number  of  microinstructions  in  block  tail  i.e.  from  the 
CC  microinstruction.  Me,  to  the  block  end. 


E[B]  can  be  found  by  observing  that  a  branch  wait  occurs  only 
if  the  predicate  wait  exceeds  the  number  of  cycles  required  to 
execute  the  remainder  of  the  block  after  the  CC  microoperation, 
m,  is  initiated,  i.e.  PrW  >  N. 


f 
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E[B]  =  2_.  t  P[PrW-N  =t] 
t  =  1 


=  ZL  t  X  ,pr  PrW=XAN=x-t  1 
t=1  x=t+1‘ 


=  ZL:.  t  Z_.P[M 
t=1  x=t-H 


tc  (X)  ]  P[  N  =  x-t  ] 


where  P[  N=x- 1  ]= 

Y=Y, 


and  Yo  =  l^x-t-l )  lER »  +  1j 

Y-,  =  \(x-t)IER*J  (4.26) 

E[N=x-t]  gives  the  probability  that  is  initiated  x-t 

microinstructions  ahead  of  the  last  microoperation  in  the  block. 
The  limits  of  that  summation  assume  that  each  microinstruction 
contains  lEE*  microoperations  in  the  interval  under 
considera  tion. 


The  expected  computation  length,  E[T],  is  the  expected  number 
of  microinstructions  initiated  between  the  branch  wait  intervals. 
In  this  interval,  it  is  known  that  there  are  no  branch  waits. 
Consequently,  the  instruction  execution  rate  is  lER*.  P-.s  the 
mean  number  of  microoperations  per  block  is  1 /p^  , 

E[T]  =  E[time  to  execute  a  block  of  microoperations | 
no  branch  wait] 


= 

VdEE'p^  ) 

(4 

.  27) 

P[w=i], 

1<i<w„,  will 

be  determined 

from  the  mean 

number  of 

microoperations  that  are 

act  iv  e 

immediately 

af  ter 

predi cate 

resolution . 

P[H=i],  i=1,2. 

..  will 

be  set 

equal 

to  t  he 

geome  trie 

distr ibut ion 

with  that  mean. 

It  will  be 

seen 

later 

that  the 

limiting  case  of  this  distribution  is  the  distribution  for  the 
conceptual  Window  of  the  microprogrammable  control  unit. 
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The  mean  number  of  microoperations  following  predicate 
evaluation  consists  of  two  components,  E[  Cy.  ]  and  E[  Wq  ].  E[  Cr  ]  is 
the  expected  number  of  residual  microoperations  that  remain  from 
the  Active  block  following  predicate  resolution,  E[  w^  ]  is  the 
expected  number  of  nicrooperat ions  that  are  prefetched  for  the 
activated  block  by  the  Converger. 


The  derivation  of  E[ c^ ]  uses  reasoning  similar  to  that  used 
to  derive  E[  B  ]  in  equation  4,26.  A  residual  microoperation 
arises  only  if  PrW  <  N,  where  N  is  the  number  of 
microinstructions  remaining  in  the  block  when  Me  is  initiated. 
During  this  interval,  as  there  are  no  branch  waits,  the  execution 
Late  is  lER’.  Thus, 


OO 


E[c^]=  lER'^i  P[N-PrW  =  i] 

i=  1 

=  IER*2^i'^P[PrW=tAN=i  +  t] 
i=1  t=1 
Tm 

=  IER»Z_.i2Z  P[  Mtc  (t)  ]P[  N=i+t  ] 
i=1  t=1 

where  P[N=i  +  t]=2i^  ^ 

j=Jo 

Jo  =  [ii  +  t-1)IER*  +  lj 
Ji  =  +  IEB*J 


(4.  28) 


The  derivation  of  E[  ]  consists  of  two  steps.  In  the  first 
step,  we  derive  the  probability  that  a  new  conditional  block, 
arising  from  a  FORK  operation,  has  k  predecessor  blocks  in  the 
Converger.  A  queueing  model  will  give  an  indication  of  the 
number  of  Converger  levels  that  can  be  utilized  for  a  given  lEP 

t 

and  Control  Memory  microoperation  transmission  rate. 


The  second  step  determines  the  expected  number  of 


microoperations  prefetched  for  a  block  from  the  time  of  its 
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inception  into  the  Converger  to  the  time  of  its  activation.  The 
method  used  models  the  path  through  the  Converger  as  a  multistage 
server,  each  stage  having  a  mean  service  rate  proportional  to 

k 

s/2  where  k  is  the  Converger  level  number.  At  each  stage,  the 

service  times  are  geometrically  distributed  with  mean  1/ (p^  lEE)  , 

the  mean  time  to  execute  a  block  of  microoperations,  where  1/p 

0 

is  the  expected  number  of  microoperations  in  a  block. 

As  the  dynamics  of  the  conditional  paths  in  the  Converger  are 
equivalent,  only  one  conditional  path  will  be  considered.  A 
queue  arrival  is  said  to  occur  when  the  conditional  branch  for  a 
block  arrives  and  a  FORK  is  invoked.  Thus  the  number  of 
customers  in  the  queue  corresponds  to  the  number  of  unresolved 
conditional  branches  in  a  conditional  path.  Micr cope  rati ons  are 
fetched  to  an  arm  at  level  k  at  a  rate 

S'  (k)  =  S(1-p„)/2'^  (4.  29) 

Because  the  probability  occurrence  of  conditional  branches  p^  , 

the  mean  customer  arrival  rate  with  k  customers  in  the  queue  is 

\ 

=  S'  (k  )  k>1  (4.  30) 

The  nature  of  the  Converger  is  such  that  there  is  always  an 
active  block.  Thus  k>1, 

A  customer's  service  is  said  to  be  completed  when  the 
predicate  of  the  Active  branch  is  evaluated.  Microoperations  are 
passed  from  the  Active  arm  to  the  Issuer  at  a  mean  rate  lER. 
Thus  blocks  or  customers  are  serviced  at  an  average  rate 

u  =  lER  p^. 


(4.  31) 
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It  will  prove  convenient  to  define  a  primarY  supply  £§.112 r  s 
s  =  S*(0)/IER  =  S.(1-p^)/IER  (4.32) 

Because  of  the  similarity  between  the  geometric  and 
exponential  probability  densities,  we  characterize  the  above 
model  by  an  M/M/1  queue  with  geometrically  discouraged  arrivals 
[KLE75].  In  addition,  we  assume  sufficient  Converger  storage 
capacity  to  prevent  storage  overruns. 

The  state  transition  diagram  for  the  queing  system  is  shown 
in  figure  4,7. 


Figure  4. 7  Converger 

letting  p.  =  Pfk 
K 

from  figure  4.7,  the 
written  [KLE75]: 


State  Transitions 
customers  in  queue  at 
following  probability 


equilibrium],  then 
equations  may  be 


k>1 


(4.33) 


M 

=  Pi[s.{1-P„)/IER]'"''  Tl 

Thus  for  the  Converger  supporting  1  levels,  equation  (4.33)  can 
be  solved  subject  to 


1-1 

21  Pi  =  1  (4.34) 

i=1 

giving 


( Z  s'^-h-k 

iri 


(4,  35) 


Equation  (4.35)  can  be  used  to  determine  the  expected  number  of 
occupied  levels  in  the  equation.  It  will  be  used  to  determine 

the  probability  that  a  customer  arrives  with  k  customers  in  the 

\ 

queue.  Thus  the  probability  that  a  block  prefetch  begins  at 
level  number  k  in  the  Converger  is  p^.  (Pecall  that  the  block 
currently  being  serviced  is  at  level  number  0.) 


Table  4.1  of  unnormalized  p  (equation  4.33)  indicates  the 

K 

relative  utilization  of  various  Converger  levels.  The  table 


shows  the  exponential  increase  in  supply  rate  needed  to  maintain 
an  additional  Converger  level.  A  doubling  of  the  primary  supply 
ratio  s  shifts  the  maximum  p^  one  level  higher.  For  a  primary 
supply  ratio  of  4,  95%  of  the  blocks  would  enter  level  number  4 
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4- 

or  lower.  In  this  case  a  5  level  Converger  requiring  2-1 
Converger  arms  would  be  sufficient  to  accommodate  the  prefetch 
traffic.  This  is  a  relatively  modest  requirement. 


|k 

1 

1 

1 

s  primary 

1  2 

supply 

4 

ratio 

8 

1 

11 

1 

1 

1 

1 

1 

1 

|2 

1 

.5 

1 

2 

4 

|3 

1  • 

13 

.5 

2 

8 

14 

1 

.063 

.  13 

1 

8 

|5 

1 

.00 

.  02 

.25 

4 

16. 

1- 

.00 

.  00 

.0  63 

1 

Table  4.1  Converger  Level  Entry  Incidence 


Table  4.1  can  also  be  used  to  estimate  the  probability  that  a 
block  is  completely  prefetched  before  it  is  activated.  This 
estimate  can  be  obtained  by  computing  the  probability  that  a 
block  enters  level  number  2  or  higher.  Entry  of  a  new  block  at 
this  level  or  higher  guarantees  that  the  preceding  block  has  been 
completely  prefetched.  Thus  for  a  5-level  Converger  with  a 
primary  supply  ratio  of  4,  more  than  83%  of  the  blocks  will  be 
completely  prefetched. 

We  may  now  proceed  to  determine  the  mean  number  of 
micr ocperations  that  are  prefetched  in  the  Converger  before  the 
block  is  activated.  The  prefetch  process  is  modeled  by  a  multi¬ 
stage  server  as  shown  in  figure  (4,  8)  . 
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Figure  4, 8  Converger  Prefetch  Server  Model  (1  levels) 

For  an  externally  arriving  customer  there  is  no  queueing  and 
the  customer  (a  block  to  be  fetched)  enters  stage  k  with 
probability  p  .  The  customer  exits  the  server  if  all  of  its 

IN 

microoperations  are  prefetched  in  stage  k  with  probability 
where  e^^  is  the  probability  a  customer  does  not  complete  service 
at  stage  k.  Otherwise,  he  passes  to  stage  k-1  for  more  service 
with  probability  e^.  Because  the  block  length  L  has  a  geometric 
distribution,  the  remaining  block  length  to  be  fetched  by  stage 
k-1  has  the  identical  distribution.  This  follows  from  the 
memoryless  property  of  geometric  densities. 

A  customer  not  completing  service  at  stage  k  has  received 
service  to  the  extent  SMk).  S*(k)  is  the  Control  Memory 
supply  rate  to  Converger  level  k  (equation  4.29)  and  is  equal  to 
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the  service  rate  of  stage  k.  Let  T  be  the  remaining  execution 
time  of  the  A.ctive  block  when  the  customer  entered  stage  k. 
Because  the  execution  rate  is  assumed  to  be  constant  at  lEE 
microoperations/cycle,  T  is  proportional  to  block  length  L, 
which  is  geometrically  distributed.  Thus  T  is  geometrically  and 
identically  distributed  for  all  customers  entering  stage  k.  h 
customer  passes  through  successive  stages  until  his  service  is 
completed.  A  customer  not  completing  after  exiting  stage 
finishes  in  stage  0.  This  corresponds  to  a  block  that  is  not 
completely  prefetched  before  it  is  activated. 

Summarizing/  each  stage  k,  1  <  k  <  1- 1  has  two  mutually 
exclusive  sources  of  customers  having  identical  distributions 
describing  their  service  requirements.  A  customer  may  enter 
stage  k  either  externally  with  probability  p  (equation  4.35)  or 
from  stage  k  +  1  with  probability  ,  the  probability  that  a 

customer’s  service  is  not  completed  in  stage  k+ 1  given  that  the 
customer  visits  stage  k+1.  Customers  may  enter  the  last  stage, 
level  1-1,  only  externally.  The  service  time  of  a  customer  at 
each  stage,  given  that  the  customer’s  service  is  not  completed, 
is  identically  and  geometrically  distributed  with  mean 
1/(p^  lER) . 

Let  L  be  the  total  amount  of  service  that  the  active  block 
requires  when  prefetch  begins  at  level  k  and  let  Ly^be  the  amount 
of  service  (number  of  microoperations  prefetched)  received  at 
level  number  k.  L^  is  conditioned  by  the  customer  visiting  stage 
k.  Let  v^^  be  the  probability  that  a  customer  visits  stage  k. 
Thus  the  amount  of  service  received  by  a  customer  in  stages  1  to 
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1-1  is  the  number  of  microoperations,  ,  prefetched  before  the 
block  is  activated 


1-1 


Wo  =  21 L 


k=1 


k\ 


with  mean. 


1-1 

E[w,  ]  =  Xl  E[L.  ]  V. 
k  =  1  ^ 


(4.  36) 


(4.  37) 


and  variance  [TH069], 
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Var  [Wo]  =>  CVar[Lo]  +  (E[  w^  ]-E[L^J)'^]  v^ 


(4.  38) 


has  two  components  corresponding  to  the  cases  when  service  is 
completed  to  the  customer  and  when  it  is  not.  Prefetch  (or 
service)  is  completed  at  level  k  when  the  necessary  prefetch  time 
at  level  k  is  less  than  or  equal  to  the  remaining  execution  time 
of  the  currently  active  block  when  prefetch  begins,  ie. , 


L  /lEK  >  Lj^/S*  (k)  .  , 

or 

L  >  L^^IEH/S  *  (k)  (4.  39) 

Otherwise  prefetch  is  not  completed  and  must  continue  at  level 
k- 1  at  rate  S*  (k-1).  We  will  assume  that  I  and  are 

independent  and  have  the  identical  exponential  distribution  with 


parameter 
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X  =i/p^ 


The  probability  tha't  service  is  iiicomplete  at  level  k 
found  by  integrating  the  joint  probctbility 
P[  x<Lj^<x-t-dx, y<L  ^y-^dy],  over  the  region  of  interest, 

oo  oo 

e^  =\xe'^A\e'^dydx 
yixIEK/S'Cw') 

CO 

\v  -XxsVs 
=  \  Ae  e  dx 

=  1/ (1  +  2  Vs) 

Where  s  is  the  primary  supply  ratio  (eguation 
equation  4.41, 

P[  service  complete  at  level  k]  =  1  -  ey^ 

=  (2Vs)/(1+2  Vs) 


The  probability  density  for  P[  x<Ly^<x+dx  ],  can 

by  considering  the  two  components  of 


P[  X  <  L  ^  <  x+dx  ]  = 

P[  X  <  x+dx  A  L  >  x2Vs]  + 

P[  L^>  X  A  x2^/s  <  L  <  x2Vs  +  d(x2Vs)  ] 


(4.  40) 

e^,  can  be 
density. 


(4.41) 

4.3  2)  .  From 

(4.U2) 

be  obtained 


(4.43) 
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The  first  component  corresponds  to  the  case  ’service  complete* 
and  is  given  by 


P[  X  <  L,^<  x  +  dx,  A  L  >  x2  /s] 


OQ 

=  X e*^  d x^X d y 
y^X2.Ws 

\  -XXCI+^'VS');, 

=  A  e  dx 


(4.45) 


The  second  component  corresponds  to  the  case  ’service  incomplete* 
and  is  given  by 


P[Ly^  >  xax2Vs  <  L  <  x2Vs  +  a(x2'^/s)  ] 


CO 


=Xe"^^^  d  ( X  2  ^  ^ y 

\  .  X,  -xXCu2  Vsi 

=  A  2  /se  dx 


(4. 46) 


Combining  eguations  4,45  and  4.46 


P[x  <  L^<  x+dx] 


X(1  +  2  Vs) 


-X 

e 


Xtu2Vs\^ 


(4.47) 


This  is  an  exponential  density.  Thus  the  expected  number  of 
microoperations  prefetched  at  level  k  is 
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E[L^]  =  1/[X(1+2Vs)] 

=  p^/(1+2‘Vs)  (4.U8I 

t 

The  probability  that  stage  k  is  visited  is  obtained  by  examining 
its  entry  pointon  figure  4.8: 


^1-1  -  Pi-1 


’k  =Pk  ♦  ®kt1 

'^1  = 

The  are  obtainable  from  eguation  4 

recurrence  of  equation  4.52.  E[  w^.  ]  can 
using  equations  4.37,  4.48  and  4.52. 

1-1 

ECwo]  = 

k  =  1 


(4.52) 

3  5,  and 

the  Vj^  from 

the 

now  be 

determine! 

by 

1-1 


/(2Vs  +  1) 


(4.531 


The  mean  number  of  microoperations  upon  activating  a  new 
block  can  be  determined  from  equations  4.28  and  4.53.  It  will  be 
assumed  that  P[W=i]  is  a  geometrical  distribution  with  mean 
E[Wo  ]  +  E[  c^],  i.  e. 

P[W=i]=  p(l-p)^"''  i  =  1,2,...  (4.55) 

»here  p=  (E[  w,  ]+E[  ])“^  (4.56) 

Equation  4.55  will  be  used  only  over  the  physically  available 
Window  positions,  1<i<Wj^,  where  W^  is  the  maximum  number  of 
Window  positions. 

The  assumption  leading  to  eguation  4.55  has  greatly 
simplified  its  derivation.  more  precise  alternative  would  be 
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to  derive  the  probabilities  for  P[ W  =i]  from  the  LaPlace 
transform  of  the  Converger  model  of  figure  4.8,  and  then  to 

develop  the  equations  describing  the  dynamics  of  the  last 
microoperation  in  t^.e  window.  This  path  of  analysis  would  be 
very  cumbersome. 

Equation  4.55  summarizes  the  effects  of  a  provision  policy 
utilizing  1  Converger  levels,  a  maximum  Window  level  W^,  a 

primary  supply  ratio  of  s,  unconditional  and  conditional  branch 
occurrence  probabilities  p^  and  p^  ,  with  a  CC  microoperation 
advance  distribution  bj^ ,  and  execution  time  probabilities 

P[Mtc  (t)  ].  Some  specialized  provision  policies  are  examined 

below. 

For  the  case  where  no  prefetch  is  used  after  a  conditional 
branch  is  detected,  w®  =0.  In  this  case  microoperations  do  not 
arrive  until  the  second  cycle  after  predicate  resolution.  This 
case  can  be  treated  by  increasing  F[B]  of  equation  (4.26)  by  1  to 

include  the  extra  null  cycle.  Following  this  null  cycle,  s' 

microoperations  would  be  fetched.  This  would-  corpespond  to  an 

hC  1  ^  • 

The  provision  policy  in  which  only  the  FALSE  conditional  arm 
is  prefetched  could  be  treated  as  a  superposition  of  the  No¬ 

prefetch  policy  described  in  the  previous  paragraph  and  of  the  2- 
level  Converger  case.  If  p^  and  p^  are  respectively  the 

probabilities  that  the  FALSE  and  TRUE  predicates  are  evaluated, 
then 

E[Wo]  =  E[Wo(S*)|no  prefetch]p^  + 

E[Wq(2S')  I  2  level  Converger  ]p^ 


(4.57) 
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The  second  term  of  equation  4.57  uses  a  2  level  Converger  with 
supply  rate  2S*  because  all  the  Control  Memory  transmission  is 
directed  to  the  F  Converger  arm. 

» 

4.4.2  The  Microprogram  Window  Model 

On  a  conventional  microprogrammable  control  unit,  the 
microprogrammer  on  composing  microinstructions  can  select 
microoperations  from  only  one  block  to  ensure  hazard  avoidance. 
The  blocks  in  this  case  are  terminated  by  either  conditional  or 
uncon di tion-al  branches.  Because  the  microprogrammer  is  not  under 
any  time  deadlines'to  compose  the  microinstruction,  he  can,  in 
principle,  examine  blocks  of  infinite  length.  ^fter 
microinstructions  have  been  composed  for  one  block,  the 
microprogrammer  can  immediately  begin  composing  microinstructions 
for  one  of  the  successor  blocks.  From  this  description,  the 
conceptual  Window  operates  with  the  following  parameters: 

1.  w„  =°o 

2.  S  ==0,  thus  »o  =  L,  the  block  length 

3.  E[L]  =  1/p^  where  Pb  =  Pu  ^  Pc 

4.  E[c^]  =  0 

As  with  the  adaptive  processor,  we  place  the  conceptual 
Window  in  a  run-time  environment  and  examine  the  dynamics.  To 
determine  P[ W=i  ]  we  derive  the  expected  time  Window  position  i 
will  contain  a  microoperation.  The  development  parallels  that 
used  for  the  Converger  model. 

In  conventional  microprogrammable  control  units,  the  CC 
microoperation  usually  resides  in  the  same  microinstruction  as 
the  branch  or  in  the  preceding  microinstruction.  In  these  cases. 
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there  is  little  or  no  overlap  of  predicate  wait  with  execution. 

In  addition  conventional  micro programmable  control  units  have 

another  source  of  branch  wait  in  those  designs  where  Control 

Memory  fetch  is  overlapped  with  execution.  When  a  conditional  or 

$ 

unconditional  branch  out  of  sequence  is  taken^  the  previous  fetch 
is  nullified  and  the  contents  of  the  target  address  must  be 
fetched.  This  adds  a  cycle  to  branch  wait.  Letting  B  denote 
branch  wait  for  a  conventional  microprogrammable  control  unit 
with  overlapped  Control  Memory  access,  and  assuming  that  the  CC 
microoperation  is  initiated  one  microinstruction  before  the 
branch,  then 


B  =  {  (PrW-1)  +  .  5 

I  1 


for  conditional  branches 
for  unconditional  branches 


Thus 

E[B 


where 


it  has 
branch 
of  0.  5. 


been  arbitrarily  assumed  that 
out  of  sequence  occurs  with  a 


a  conditional 
probability 

(4.  58) 


=  F[B  I  conditional  branch  ]p^  + 

E[  B  I  unconditional  branch 

=  (E[PrW]  -.S)  p^  p^  (4.  59) 


The  expected  time  that  there  is  no  branch  wait,  E[ T  ] ,  can  be 
determined  by  noting  that  one  microinstruction  is  initiated  every 
cycle.  Thus  E[  T ]  can  be  computed  using  the  reasoning  that  lead 
to  equation  4,27.  In  this  case  as  blocks  are  terminated  by  both 
conditional  and  unconditional  branches,  the  mean  block  length  is 
1/p^  where  p^=p^j^+p^..  lER*  is  used  because  we  consider  a  sequence 
of  microinstructions  containing  no  branch  waits.  Thus 


E[T  ]  =  1/(IER*p^^) 


(4. 60) 


The  expected  predicate  wait  time  is 
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E[PrW  ] 


t  P[  Mtc(t)  ] 


(4.61) 


where  Mtc(t)  is  the  set  of  CC  microoperations  whose  execution 
time  is  t. 


For  the  conventional  microprogr ammable  control  unit,  we  can 
provide  a  more  precise  treatment  in  obtaining  the  probability 
distribution  for  W  ,  given  W>0.  Blocks,  when  they  are  activated, 
have  been  prefetched  (conceptually)  in  their  entirety.  They  then 
pass  through  the  Window  at  the  rate  lER  microoperations/cycle. 
Thus  window  position  i  will  be  occupied  for  a  time  ( j-i  +  1 ) /IE?.* 
if  the  block  length  were  j  microoperations,  (j>i) .  Thus 

CO 

P[w=i]  =K  21  (j-i  +  l)  Pt,  C-Pb)’”’ 
d=i 

=K  (1-Pb)^'  /(Pb  lEP,')  (4.62) 

where  K  =(  Zl(1-Pur  /(PlIER'))”^ 

j  =  1 

=  Pb^  IE5’ 

Thus  P[  W=i]  =  (1-pjj)'--''  (4.63) 

This  is  the  original  distribution  of  the  block  length  of  the 
program  structure  used  on  the  microprogrammable  control  unit. 
(Eguation  4.63  can  be  obtained  directly  using  the  memoryless 
property  of  geometric  distributions.)  Note  the  limiting  case  for 
the  adaptive  processor  in  the  limit  as  E[  Wq  ]-->1 /p^ .  If 

in  eguation  4.56,  E[Cj-]=0,  equation  4.55  describing  P[W=i]  for 
the  adaptive  processor  is  the  direct  counterpart  of  equation 


4.  63. 
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A  direct  comparison  between  equations  4.59  and  4.26,  and  4.60 
and  4.27  shows  that  branch  waits  on  a  conventional 
microprogrammable  control  unit  are  longer  than  for  an  adaptive 
processor  under  equivalent  operating  conditions.  These  operating 
conditions  are  same  p^.  and  same  lER*  for  both  processors.  This 
difference  is  due  to 

1.  A  longer  effective  block  length  for  the  adaptive 

processor, 

1/p^  vs  ^/ +  Po  )  • 

2.  Limited  disruption  due  to  unconditional  branches  -  they 
are  resolved  in  the  Converger. 

3.  The  adaptive  processor  can  exploit  residual 
microoperations  if  any  exist. 

4.5  Scan  Space  Models 

The  goal  of  a  scan  space  model  is  to  determine  the  lEF:  of  a 
basic  processor.  The  incipient  viewpoint  of  the  modeling  process 
by  the  scan  space  model  differs  from  that  of  the  issue  delay 
model.  The  scan  space  model  derives  a  probability  distribution 
for  the  random  variable  I,  the  number  of  microoperations  issued 
each  scan  cycle.  The  lER  is  obtained  from  E[I],  the  expectation 
of  I. 

The  basis  of  the  following  processor  performance  models  is 
derived  from  the  microprogramming  process.  A  microprogrammer 
when  composing  a  microinstruction,  scans  a  conceptual  Window  of 
candidate  microoperations,  selecting  some  for  membership, 
bypassing  others,  until  composition  of  the  current 
microinstruction  has  been  terminated.  The  ones  which  have  been 
bypassed  are  delayed  and  reconsidered  for  composition  in  the  next 
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microinstruction.  We  characterize  this  microprogramming  process 
with  a  Scan  Space,  S (I , P , T)  ,  where 

I  =  the  number  of  microoperation* s  issued  to  the  current 
m icroinstruction 

D  =  the  number  of  microoperations  delayed  while  composing 
the  current  microinstruction. 

T  =  a  binary  variable  indicating  if  the  state  is 
transient  or  terminal. 

A  scan  is  the  process  that  examines  the  microoperations  in  the 
Window  and  composes  a  microinstruction. 

For  the  adaptive  processor,  microinstruction  composition  is 
performed  in  real-time  and  the  scan  cycle  has  a  well  defined 
period.  On  conventional  microprogr ammable  computers,  composition 
is  not  restricted  by  real-time. 

During  a  particular  scan,  only  one  of  the  state  variables  can 
be  incremented  during  transition.  Thus  the  scanning  process  is  a 
birth  process  [KLE75],  with  the  process  terminating  when  a 
terminal  state  is  entered.  A  terminal  state  is  reached  when  the 
scan  process  overreaches  the  physical  capabilities  of  the 
machine. 

The  extent  of  the  scan  space  is  limited  by  the  physical 
resources  of  the  machine.  The  variable  I  can  be  limited,  for 
example,  by  the  number  of  control  fields  in  a  microinstruction 
format,  by  the  transmission  rate  of  the  Window  in  a  lookahead 
processor,  or  by  min  {issue  rate,  number  of  function  units}  in  an 
adaptive  processor.  The  extent  of  a  possible  scan  space  in  an  I- 
D  plane  is  shown  in  figure  4.9.  Further  logical  constraints  from 
the  program  structure  are  imposed  on  the  scan  space.  The 
conditional  branch  reduces  the  effective  range  of  the  Window,  and 
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Figure  4,9  Scan  Space  Extent 
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consequently,  the  effective  extent  of  the  scan  space.  The 
limits  of  D  and  I  are  [0,DM(i)]  and  [0,IM(d)  ],  respectively, 

where  DM  (i)  =  maximum  D  for  I=i  and 

IM  (d)  =  maximum  I  for  D=d. 

In  the  derivation  of  a  scan  space  model,  the  goal  will  be  to 
determine  the  terminal  scan  state  probability  distribution 
function,  P[(I,D,1)].  The  lER  is  then  obtained  from 

IER»=  E[Il/(1-p^)  (4.64) 

Note  that  IEE»,  not  lER,  is  derived  from  the  scan  space  model. 
The  scan  space  model  does  not  include  the  effect  of  branch  waits 
nor  does  it  directly  include  the  issuance  of  branches.  Branch 
«aits  are  included  into  I ER  using  equation  4.25.  The  issuance  of 
branches  is  included  in  lER*  by  adjusting  the  mean  number  of 
microoperations  issued,  as  shown  in  equation  4.64. 

In  the  analysis,  we  derive  state  transition  probabilities 
from  each  transient  state  into  one  of  three  mutually  exclusive 
directions  (fig.  4.10).  Let 

PI[I,D]  =  issue  tralisition  probability  from  state  (I,D,C) 

PD[I,D]  =  delay  transition  probability  from  state  (I,D,0) 

PT[I,D]  =  termination  probability  from  state  (I,P,C) 

subject  to 

PI[I,D]  +  PD[I,D]  ^  PT[I,D]  =  1  (4.65) 

These  probabilities  will  be  obtained  by  defining  probabilistic 
events  on  the  scan  state  process  and  then  deriving  their 
probabilities  in  terms  of  model  inputs.  For  each  of  the  models 
that  follow,  we  will  derive  expressions  for  ?I[I,D]  and  ?T[I,D]. 
PD[I,D]  may  then  be  obtained  using  pguation  4.65.  To  simplify 
formulas  that  follow,  let 
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Figure  4.10  Scan  Space  Transitions 


4-5U 


PC  (I/D, T)  ]  =  PI[I,  D]  =  ?D[I,7)]  =  0  (4.66) 

for  all  I,D  not  within  the  extent  of  the  scan  space. 

The  relationship  used  to  calculate  the  terminal  state 
probabilities  is: 

P[  (I,D,1)  ]  =  PT[I,D]  P[  (I,D,0)  ] 

=  PT[I,D](PI[I-1,D]  P[  (1-1, D,0)  ]  + 

PD[I,D-1]  P[I,D-1,0])  (4.67) 

The  above  is  a  recurrence  relation  subject  to  equation  4.65  and 

by  assigning  state  (0,0,0)  a  convenient  value,  a  probability 

weight  may  be  calculated  for  each  terminal  state  (I,D,1)  in  the 

extent  of  the  scan  sp^’ce.  Normalization  will  then  give 

P[(I,D,1)  ],  allowing  the  calculation  of  the  lER*.  Unfortunately, 

however,  because  the  transition  probabilities  are  functions  of 

the  lER *  and  because  of  the  complexity  of  the  models,  a  closed 

form  solution  for  the  lER*  does  not  seem  possible.  This  forces 

an  iterative  procedure  on  the  computation  of  lER* . 

To  facilitate  the  description  of  the  model  analyses,  the 
following  definitions  are  helpful.  We  define  a  system  state, 
S= (X, D , T , Y ,R , W)  ,  to  describe  the  status  of  the  system  components 
while  microoperations  are  being  examined  for  issuance  from  the 
window.  The  first  three  components  have  already  been  introduced 
and  they  define  the  scan  state.  Further  description  of  the 
components  follows; 


X=  (x-i  ,  Xp  ,  .  .  . ,  Xp)  is  the  composition  state.  It  is  a  refinement  of 
the  state  variable,  I,  defined  before.  Component  , 

1<f  <F  ,  0  <ap  specifies  the  number  of  f  -  micr  cope  rati  on  s  tha 
await  initiation  at  the  beginning  of  the  next  execution  cycle, 
is  the  number  of  types  of  function  units.  The  value  n^  is  the 
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maximum  number  of  f-microoperat ions  that  can  await  initiation. 
We  will  use  the  convention  that  x-j  designates  the  conditional 
branch  micrcoperation. 

^  St  at  e.  It  specifies  the  number  of 
microoperations  whose  issuance  is  delayed  at  a  given  point  in  a 
scan  cycle. 

“  scan  flag.  T=0  indicates  that  the  window  scan  is  continuing. 
T=1  indicates  that  the  scan  has  terminated. 


Y  =  (y^  #72. §12;^®  operational 
unit.  Each  component,  y^  ,  specifies  the  remaining  execution  time 
in  control  cycles  of  the  i*th  function  unit.  The  y^ ’ s  are  always 
specified  relative  to  the  beginning  of  the  next  execution  cycle. 
N  is  the  total  number  of  function  units  in  the  operational  unit. 


R  =  the  register  state.  This  is  the  number  of  occupied  registers 
that  cannot  be  specified  as  a  destination  by  issuing 
microoperations. 


W  =  the  §t§i®*  ^^®  window  state 
availability  of  microoperations  for  issuing  in 
cycle.  Its  probability  distribution  has  been 
previous  section. 


specifies  the 
a  given  scan 
analyzed  in  the 


We  say  that  the  current  composition  state  and  com puta tion 
state  are  f-extensible  if 

1  .  X^  <  n.(: 

2.  xi  =  C 

3.  P.n  f-function  unit  will  be  available  in  the  next  cycle. 
These  conditions  specify  the  f-resource  status  that  must  be 
satisfied  for  an  f -microo peration  to  be  issued  in  the  current 


scan  cycle. 


In  the  discussion  below,  we  concentrate  on  the  development  of 
a  basic  processor  model  for  conventional  microprocessors.  This 
description  will  be  rather  detailed.  The  descriptions  of 
variants  of  basic  processor  models  will  be  briefer,  using  the 
following  model  as  a  point  of  departure. 


^•6  Convent ional  Mi croprogram ma^g  Computers 
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h  model  for  a  maximal  microprogrammable  computer  is  derived, 
Recall  that  a  maximal  microinstruction  format  has  a  separate 
control  field  for  each  function  unit  in  the  operation  unit.  The 
equations  for  the  vertical  and  horizontal  microprogrammable 
processors,  which  are  restrictions  of  the  maximal  machine,  will 
also  be  derived.  The  analysis  follows  a  discussion  of  the 
machine  configuration  and  control  policy  which  describe  this 
processor  class. 

4.  6.  1  Machine  Conf  igurati on 


The  machine  configuration  specifies  the  operational  unit 
characteristics,  and  the  control  unit  characteristics.  The 
operational  unit  characteristics  are  specified  by  the  number  of 
work  registers,  by  a  function  unit  vector  giving  the  function 
unit  types-  used  and  their  multiplicity,  and  by  a  microoperation- 
function  unit  map  which  specifies  the  execution  time  and  the 
function  uni.t  of  each  microoperation.  For  microprogrammable 
processors-  the  function  unit  vector  Is  identical,  in  its 
specification  of  the  number  and  types  of  function  units,  to  the 
maximal  microinstruction  format. 


The  control  unit  characteristics  include  the  parameters  of 
the  Stream  Controller,  the  disposition  of  memory  systems  with 
respect  to  operations,  data,  and  external  or  I/O  processes  and 
the  number  and  types  of  control  fields. 

The  number  and  t^pes  of  control  fields  used  by  the  control 
unit  are  specified  by  the  set  of  permissible  microinstruction 
formats  used  by  the  machine.  The  control  fields  correspond  to 
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the  microoperation  fields  of  the  microinstruction  register  and 
have  the  restriction  that  their  contents  must  all  be  initiated  in 
the  same  execution  cycle.  Further  restrictions  are  that  the 
number  of  f-control  fields  specified  by  a  microinstruction  format 
does  not  exceed  the  number  of  f-f unction  units  in  the  operational 
unit  and  that  no  buffering  is  provided  for  source  operands  at  the 
function  units.  These  restrictions  deprive  the  micro programm able 
processor  of  the  ability  to  automatically  detect  and  exploit 
parallelism.  This  capability  of  parallelism  exploitation  and 
detection  is  -provided  by  the  microprogrammer  compiling  the 
microprogram  before  its  execution. 

The  conceptual  Window  that  the  microprogrammer  uses  when 
composing  microinstructions  has  parameters  that  are  restricted 
only  by  the  microinstruction  formats.  The  Window  width  and 
Issuer  scan  rate  are  infinite.  However,  the  number  of 
microoperations  issued  is  limited  to  the  microinstruction  format 
degree.  In  addition  the  issued  microoperations  must  be  contained 
in  the  same  block  of  the  microprogram. 

r  \ 

4.6.2  T^ 

This  control  policy  for  microprogrammable  processors  is 
discussed  by  reference  to  the  significant  features  of  the 
microprogramming  process. 

The  conceptual  provision  policy  for  the  microprogrammable 
processor  has  been  described  in  section  4,4  where  the  probability 
of  Window  occupancy  has  been  derived.  Below  is  the  issue  policy 
that  must  be  satisfied  for  the  issue  of  an  f- micr ooperati on  f or  a 
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given  machine  state.  It  is  comprised  of  the  data  and  stream 
hazard  avoidance  conditions  described  in  section  3.6.1  augmented 
by  conditions  specifying  resource  demands. 

ISSUE:  When  (1)  There  exists  an  f -m icrooperation,  m,  in  the 

current  window  position,  and 

(2)  the  current  composition  state  and  the  current 
execution  state  are  f-extensible  and 

(3)  S  (m)AD(  (E  V  If.  V  Qf.})  =  ♦ 

(4)  (i)  D  (m)  A  D(  {EvIAvQA})  =  () 

(ii)D  (m)  A  S  (QA)  =  ^ 

(5)  if  m  is  a  branch,  there  can  be  no  delayed 
microoperations  preceding  it. 

If  all  of  the  above  conditions  are  satisfied,  microoperation  m  is 
issued,  or  in  the  present  context,  made  a  member  of  the 
microinstruction  being  currently  composed.  If  the  conditions  are 
not  all  sati.sfied,  m  is  delayed  and  reconsidered  in  the  next  scan 
cycle . 

Condition  (1)  establishes  the  existence  of  m.  Its 
probability  is  determined  jointly  by  the  probability  distribution 
for  W  deriv-ed  in  the  previous  section  and  by  the  empirical 
microoperation  occurrence  probability  distribution  0  (section 
4,1,4,).  Condition  (2)  expresses  the  need  for  an  empty  control 
field  in  the  current  microinstruction  and  a  free  f-function  unit 
at  the  beginning  of  the  next  execution  cycle.  Only  the  former 
requirement  is  essential.  The  latter  is  required  to  avoid  an 
initiation  delay  to  the  current  microinstruction  and  a  potential 
data  hazard.  This  tends  to  increase  performance  at  a  possible 
space  cost  to  control  memory.  Condition  (2)  through  f- 
exten sibi lity ,  prohibits  code  motion  around  branches  while  a 
microinstruction  is  composed.  This  is  a  fact  of  life  to 
microprogrammers  working  in  a  static  environment  which 
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effectively  limits  the  Window  contents  to  the  microoperations  in 
a  single  block.  The  probability  that  condition  (2)  is  satisfied 
will  be  determined  by  probabilities  associated  with  the 
composition  and  computation  states.  Condition  (3)  specifies  that 
the  sources  of  m  must  be  available  at  the  beginning  of  the  next 
execution  cycle.  Condition  <4)  expresses  the  need  for  a 
destination  register  for  register  value  creating  microoperations. 
As  the  register  assignments  can  be  made  using  an  infinite  window, 
register  assignments  can  be  reasonably  optimal.  Because  memory 
store  microoperations  do  not  require  destination  registers,  they 
are  exempt  from  condition  (4)  .  Condi-^ion  (5)  specifies  the  fact 
that  in  a  conventional  microprogram,  the  branch  is  in  the  last 
microinstruction  to  be  initiated  in  a  block.  The  effect  of 
condition  (5)  will  be  included  in  the  treatment  of  condition  (1). 

The  conditions  above  form  a  basis  for  sets  of  events  whose 
probabilities  will  be  determined.  If  all  of  the  conditions  are 
satisfied,  microoperation  m  will  be  issued  and  an  issue 
transition  in  the  scan  space  occurs.  If  they  are  not  all 
satisfied,  microoperation  m  is  delayed  and  the  scan  either 
continues  or  terminates.  These  transitions  and  their  events  will 
be  analyzed  in  following  sections. 

4.7  Scan  State  Tr^sition  Probabilities 

From  the  description  of  the  microprogramming  process,  it  is 
seen  that  the  scan  space  is  limited  in  the  I  direction  by  the 
maximum  microinstruction  format  degree  and  is  unlimited  in  the  D 
direction  because  the  scan  rate  is  infinite.  In  practice  D  does 
have  an  effective  upper  bound  because  microprograms  are  finite 
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and  the  occurrence  probability  of  conditional  branch 
microoperations  effectively  reduces  the  event  probability  of  D 
with  geometric  rapidity. 

In  the  following,  we  specify  three  mutually  exclusive  sets  of 
events  which  must  be  satisfied  for  transitions  out  of  scan  state 
(I,D,0).  These  sets  of  events  correspond  to  Issue,  Delay,  and 
Termination  transitions.  They  are  formulated  from  the  statement 
of  the  control  policy  and  the  specifications  of  the  machine 
configuration.  The  event  sets  for  issue  transition  and 
termination  transition  will  be  explicitly  stated  and  the 
corresponding  probabilities  will  be  derived.  The  delay 
transition  probability  will  then  be  derived  using  equation  4.65. 

Each  event  is  provided  with  a  name  that  will  be  used  to 
denote  the  probability  density  function  of  that  event.  The 
following  is  a  derivation  for  the  issue  transition  probability. 

from 

For  an  f-micr ooperation  to  be  issued  when  the  scan  process  is 
in  state  (I,D,0),  all  of  the  following  events  must  occur: 

1.  W  (I+D)  :  {window  position  I  +  D  must  contain  an  active 

microoperation} 

2.  P[Mf(f)]  :  (the  issuing  microoperation  is  an  f -microoperat  ion  j  1 . } 

3.  EXT(I,f)  :  {the  composition  state  and  computation  state 

must  be  f- extensible | 1 , 2} 

4.  NOD  (I-fD,  f)  ;  {there  are  no  delaying  operand  dependencies 

from  position  I+Dj 1,2,3} 

5.  R(I,D,f)  :  {There  will  be  a  destination  register  availablel 

1,2, 3, 4} 
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Note  that  the  events  given  in  the  above  order  have  been 
conditioned  on  the  occurrence  of  the  previous  events.  Not  all  of 
the  conditioning  is  significant  and  will  be  discussed  in  the 
following  sections*  The  first  two  conditions  establish  the 
existence  of  an  f- micr ooperation  in  Window  position  I+D.  The 
next  three  specify  the  requirements  for  issue. 

Each  f-class  of  microoperations  has  a  separate  set  of  issue 
conditions. 

The  issue  transition  probability  out  of  state  (I,D,0)  is  the 
sum  of  the  probability  contributions  from  all  f-types.  Using  t^e 
function  descriptors  defined  for  the  events  above, 

PI[I,D]  = 

r 

N  (I  +  D)21  PCMf  (f)  ]EXT(I,f)  NOD(I  +  D,f)  R(I,D,f) 
f  =  2 

(4.68) 

The  derivation  of  the  expressions  for  the  terms  of  equation 
(4.68)  follows. 

W(I^D) 

The  Window  occupancy  distribution  has  been  derived  in  section 
4.2.  In  the  current  context,  we  assume  the  Window  is  not  empty. 
Equation  4.63  giving  P[W=i]  was  derived  using  block  lengths  that 
included  the  terminating  branch  microoperations.  Because  we 
include  the  effect  of  branch  microoperation  issue  through 
equation  4.64,  the  branch  microoperation  in  the  Window  denotes 
the  first  empty  Window  position  as  far  as  the  Issuer  is 
concerned.  Thus  the  probability  in  equation  4,63  is  adjusted  to 
omit  the  occurrence  of  the  delimiting  branch. 
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W  (I+D)  =  (I-P5)  (^.69) 

where  Pt  =  ^  Pc 

P[  Mf  (f)  ] 

This  is  the  probability  that  an  f -microoperation  occurs.  Because 
we  do  not  include  branches,  P[Mf(f)  ]  must  be  conditioned  by  this 
fact.  Henceforth 

p^  =  P[  Mf  (f)  ]/(1-p^)  for  2<f<F  (4. ■70) 

will  be  used  as  the  probability  that  an  f-m icrooper atio n  occurs 
given  that  a  branch  does  not  occur. 

EXT (I  ,f) 

From  the  specification  of  the  execution  policy,  the  f- 
extensibi lity  of  a  scan  state  requires  the  availability  of  an  f- 
function  unit  in  the  next  execution  cycle  and  an  f-control  field 
in  the  current  microinstruction.  A.s  the  maximal  processor  has  a 
1-1  correspondence  between  function  units  and  control  fields, 
composition  state  extensibility  and  computation  state 
extensibility  are  not  independent.  We  begin  by  obtaining  a 
probability  density  function  for  the  number  of  f- function  units, 

,  that  will  remain  busy  in  the  next  execution  cycle. 

An  exact  analysis  of  the  distribution  for  would  require  a 
conditioning  on  the  microinstructions  initiated  in  previous 
cycles,  as  well  as  the  unknown  distribution  for  I.  Instead,  an 
approximate  analysis  based  on  Little’s  result  [KLE75]  will  be 
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used.  Several  preliminary  quantities  must  first  be  derived. 

These  are 

EET(f),  the  expected  execution  time  of  an  f-function  unit 

executing  the  given  microoperation  mix,  and 
]r  the  expected  number  of  f-function  units  to  remain 
busy  in  the  next  control  cycle. 

EET (f )  is  the  weighted  execution  time  of  an  f-function  unit  as 
weighted  by  the  f  occurrence  probabilities.  Thus 

Tm 

EET(f)  =  (VP[Hf  (f)  ])21t  P[Hft(f,t)  ]  (4.71) 

As  the  mean  arrival  rate  of  f -microoperations  is 
IER*P[Mf(f)  ],  the  mean  number  of  busy  -f-function  units,  using 
Little's  result,  is 

EET(f)  •IER*P[  Mf  (f)  ] 

As  the  mean  departure  rate  of  f- microoperation s  in  equilibrium 
equals  the  mean  arrival  rate, 

ECY^  ]=(EET{f)  -1)  IEP*P[Mf  (f)  ]  (4.7  2) 

Assuming  that  the  f-function  units  are  assigned  randomly  to 
incoming  microoperations, 

P[an  f-function  unit  is  busy  at  the  beginning  of  an  execution  cycle] 
=  E[Y^  ]/n^  (4.73) 

Further,  assuming  that  the  busyness  of  an  f-function  unit  is 
independent  of  the  busyness  of  other  f-function  units,  P[  =i  ]  is 
a  binomial  distribution  given  by 
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PC^=il  = 


i  -i 

P  (I-P) 


(4.74) 


where  p  =  E[  Y^f.  ]/n^ 


The  f-extensibility  of  the  composition  state  is  dependent  on 
the  composition  state  degree,  the  number  of  f -microoperat ions 
already  issued,  and  on  the  number  of  busy  f-function  units  as 


determined  from  the  computation  state.  We  first  derive  the 


probability  of  a  composition  state  assuming  no  dependency  on  the 
computation  state.  K  coupling  factor  will  be  derived  afterward. 

The  composition  state  probabilities  can  be  characterized  by 
the  following  simple  combinatorial  model.  We  have  an  infinite 
urn  with  balls  of  F  different  colours  in  the  proportion 
p^  ,p^  ,...PFf  and  a  box  with  N  slots,  n^  of  colour  f,  1<f<F,  such 
that  n-i  +n2_ +.  .  .  np  =N.  Balls  are  picked  at  random  from  the  urn, 
with  balls  of  colour  f  being  placed  into  one  of  the  slots.  If 
none  of  the  n^  slots  is  empty,  the  f-ball  is  returned  to  the  urn. 
The  problem  is  to  find  the  probability  that  a  ball  of  coloar  f 
can  be  accepted  by  the  box  given  that  the  box  contains  I  balls. 
The  computational  difficulty  with  this  problem  is  that  the 
probability  of  acceptance  is  conditioned  by  the  current  state  of 
the  box.  For  example,  if  for  some  colour  i,  all  n^  slots  are 
full,  then  the  probabilities  of  selection  for  balls  of  colour  f 
are  changed  to  p^/C^-p^).  Thus  determining  f-extensibility  given 
I  will  in  general  require  the  determination  of  the  probabilities 
of  each  composition  state.  This  would  be  most  efficiently  done 
by  following  a  branching  process,  determining  the  probabilities 
of  all  permissible  states  at  level  I  from  those  at  level  1-1,  As 


these  probabilities  are  constant  through  the  lEE  iteration 
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process,  these  computations  could  be  undertaken  for  a  reasonable 
number  of  states. 

In  the  analysis,  we  approximate  composition  state 
probabilities  by  a  restricted  multinomial  distribution.  This 
model  arises  if  we  consider  constant  selection  probabilities. 
This  characterization  would  be  precise  if  microoperation  issuing 
ceases  after  the  first  delayed  microoperation.  The  restriction 
occurs  because  not  all  combinations  of  micr ooperati ons  are 
permissible  in  the  microinstructions.  First  we  introduce  some 
definitions. 

Let  X=  (x^  ,x^,...,Xp)  be  the  composition  state  and 
Z=  (z  ^  ,  Z2^ ,  . .  . ,  Zp)  be  a  microinstruction  format  specification. 
Then  the  set  of  Eermissible  microinstructions,  MI,  with  respect 
to  Z  is  defined  by 

MI  =  {Xjx^<  z^  for  all  f  and  for  x-|  =  0}  (4.75) 

A  permissible  microinstruction  cannot  specify  more  f- 
micr ooperations  than  the  number  of  f-fields  provided.  Also,  we 
do  not  include  the  occurrence  of  branch  microoperations.  From 
the  definition  of  f- extensibil it y ,  the  set  of  f-extensible 
microinstructions  of  degree  I,  MIX(I,f)  is 

MIX(I,f)  =  {X  I  xeMI,  1  X  I  =1,  x^  =0  ,x^  <n^} . 

To  compactify  future  equations,  we  use  the  following  to  denote  a 
multinomial  coefficient: 
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T  ! 


(I  c  X)  =1 


f  f  •  •  •  /  Xp 


x^  !  X3  !  .  •  •  Xp  ! 


(4.76) 


Given 


an 


f-extensible 


composition  state 


X=  (Xg^  ,x^ ,  • . .  .  Xp )  with  |X|=I  and  X4<nc#  its  occurrence 

probability  is 


P[X(I,f-extensible]=  Kx  (I  C  X)  («.77) 


where  is  the  normalization  constant.  Recall  that  in  our 

function  unit  naming  convention,  the  1-function  unit  always 
executes  branch  microoperations  and  that- we  do  not  include  branch 
issues.  CX(I,f),  the  probability  that  the  current  composition 
state  with  degree  I  is  f-extensible  is  obtained  by  Summing  the 
probabilities  for  all  X€.MIX(I,f): 

CX(I,f)  =21  Kl(7  c  X)  P^^. .  pi^  . . .  p^'=  (a. 78) 

xe«ix(T,f) 

The  normalization  constant,  Kj_,  is  the  inverse  sum  of  the 
probabilities  of  all  permissible  microinstructions  with  degree  1, 
i.e.,  the  sum  over  all  xeKI(I)  =  {XI  xeMI^I  X I  =1}  : 

Ki  =  21  (7  C  X)pJ^.  .p^  ,..p^f  (‘i.79) 

XfeKI(I) 

To  compute  CX  (I,f)  for  a  given  microoperation  mix  and 
microinstruction  format  specification  the  storage  requirements 
are  0  (N)  ,  where  N  is  the  maximum  format  degree,  and  the  time 
requirements  are  proportional  to  the  number  of  states,  0(2^). 

The  excessive  computation  time  for  CX  (I,  f)  is  due  to  the 
detailed  representation  of  the  computation  state  X.  Computation 
can  be  reduced  by  distinguishing  between  only  two  classes  of 
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microoperations  -  f  or  non-f.  This  approximation  of  the 
computation  states  suggests  the  use  of  a  binomial  distribution  to 
approximate  CX(I,f).  For  the  set  of  computation  states 
represented  by  I,  the  probability  that  i  f-fields  are  occupied  is 


(a. 80) 


For  r  the  truncated  binomial  must  be  used  because  the  number 

of  f- microoperation  occurrences  cannot  exceed  n^  .  Thus 

1  •  I  <  Tlf 

1  -  P[X4=n+|I]/ ^P[  x^=i|I] 

1=0 

=  i|I]/ZL 

i=I 

LO  I  =  (^-81) 

where  I©  =  I+n^-I^^^ 

the  minimum  number  of  f-jn icroope rations  in  the  current  composition 
state 


CXCI,f  ]  =  < 


^4-1 

z:. 

l=Io 


P[x. 


and 


\ 


The  dependency  of  the  composition  state  on  the  computation 
state  imposes  the  constraint  x^  <  n^-Y^,  necessitating  a  joint 
probability  density  function  to  describe  f -exten sibi lity .  Using 
this  constraint,  EXT(I,f)  ,  the  probability  that  the  computation 
and  composition  states  are  f-extensible  given  that  I 
microoperations  have  been  issued,  is 


r\4-l  ^4-3-1 


EXT  (I  ,f) 


(4.82) 
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The  braced  term  in  equation  (4.82)  weights  the  probability  of  a 
composition  state  specifying  f-micr ooperations  by  the 

probability  that  the  computation  state  will  accommodate 
additional  executing  f-function  units.  We  can  significantly 
simplify  equation  (4,82)  by  weighting  the  composition  state 
probabilities  with  the  mean  of  the  weights.  Let  b^  be  this  mean. 
Then 


afJ-1 


j=0  i=0 


=  1/n^  (nr 
i=0 


PCY4  =i] 


-i)  P[Y4=i] 


=  1/114  ‘  E[Y^]1 

=  1  -  Ef  Y4  ]/n4  (4.  83) 

where  E[  Y4  ]  is  defined  by  equation  (4.72).  Observe  that  b-f  is 
the  probability  that  a  specific  f-function  unit  will  be  idle  at 
the  beginning  of  the  cycle.  This  interpretation  of  b^  is  based 
on  the  assumptions  leading  to  equation  4.73,  Using  b^  and 
CX(I,f), 


EXT  (I,f)  =  b^  CX{I,f)  , 


(4.  8U) 


NOD(I+D,f) 

The  operand  delay  experienced  by  an  issuing  raicrooperation  is 
caused  by  a  preceding  microoperation  in  one  of  three  states. 
These  are  the  computation/  composition,  or  delay  states.  Let 
A,B,  and  C  designate  the  events  that  the  source  of  an  issuing 
microoperation  must  wait  for  the  result  of  a  preceding 
microoperation.  The  correspondence  of  the  events  to  the 
microoperation  state  is: 
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P-  -  an  executing  microoperation , 

B  -  an  issued  microoperat ion,  or 

C  -  a  delayed  microoperation. 

We  will  derive  P[AvBvC|I,D]  using  the  operand  dependency 
distribution  o^  •  NOD[I-*‘D,f]  can  then  be  easily  determined. 
Although  0]_  has  a  different  distribution  depending  on  whether  the 
f-micropperation  is  monadic  or  dyadic  (equations  4.4  and  4.5),  we 
do  not  make  a  distinction  until  the  formula  for  NOD(I+D,f)  is 
stated. 

We  assume  that  the  stream  order,  issue  order,  and  initiation 
order  of  the  microoperations  is  identical ,  This  assumption 
permits  us  to  use  the  same  operand  dependency  distribution,  o^ , 
for  issued,  delayed,  and  executing  microoperations.  In  general, 
this  assumption  is  not  true  because  scrambling  of  the 
microoperation  stream  order  in  the  Window  during  microinstruction 
composition  is  inevitable.  It  is  this  property  which  allows 

execution  ready  microoperations  to  bypass  those  which  are 

\ 

delayed.  Furthermore,  this  scrambling  produces  a  different 
operand  dependency  distribution  for  each  window  position, 
approaching  the  original  distribution  with  increasing  The 

effect  of  this  assumption  on  lER  will  be  described  qualitatively. 

Assume  that  the  model  has  reached  equilibrium.  Thus  at  the 
beginning  of  each  scan  cycle,  examination  of  microoperations  for 
issuing  begins  with  the  delayed  microoperations,  followed  by  the 
microoperation  stream.  If  the  first  microoperation  examined  was 
delayed  in  the  previous  scan  cycle,  there  can  be  no  operand  delay 
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from  the  issued  microoperations  that  bypassed  it*  This  follows 
from  the  local  independence  relation  that  a  pair  of 
microoperations  must  satisfy  if  their  order  interchange  is  to  be 
permitted*  Consequently,  the  delayed  microoperation's  operand 
dependency  distribution,  if  conditioned  by  b  bypassing 
microoperations,  will  have  b  zero  initial  entries*  This  effect 
would  increase  the  issue  probability  of  the  delayed 
ffiicrooperation  over  a  microoperation  which  has  not  been  delayed 
and  bypassed.  As  the  number  of  bypassing  microoperations  for  a 
particular  scan  depends  on  the  path  taken  through  the  scan  space, 
an  exact  analysis  of  transformed  operand  dependencies  would 
require  the  examination  of  all  possible  paths.  In  a  scan  space 
rectangle  measuring  [i,d],  there  are  paths,  making  a 
detailed  analysis  computationally  formidable.  Path  dependency  is 
illustrated  in  figure  4.11  in  which  paths  P  and  Q  terminate  at 
state  (i,d)*  In  path  P,  four  microoperations  bypass  the  first 
three,  while  in  path  Q,  no  microoperations  are  bypassed. 


D 


Figure  4.11  Paths  with  different  numbers  of  bypassed  mic rooperatioi 


The  derivation  of  P[ E  v  C]  is  relatively  straightforward  and 
will  be  given  first.  Any  register  value  creating  microopera tion 
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in  the  current  microinstruction  or  in  the  delay  queue  can 
contribute  to  operand  delay. 

Assuming  that  the  operand  references  are  independently  and 
identically  distributed  for  all  operands,  and  using  the  operand 
referencing  distribution,  o^, 

I-«-D 

P[EvC|  I,D]  =  P[Mv]^^o-^  (4.85) 

1  =  1 

where  Mv  is  the  set  of  register  value  creating  microopera tion  s. 

Event  A  is  conditioned  by  the  computation  state  because  the 
issuing  microoperation  may  be  waiting  for  the  f-function  unit  in 
addition  to  waiting  for  the  operand.  we  first  derive  P[  A  |I=D=0] 
and  then  we  derive  a  multiplier  to  reflect  the  dependency. 

Assume  that  lER  microoperations  are  initiated  by  each  of  the 
previous  microinstructions.  Let 

=  the  maximum  execution  time  of  a  microoperation, 

Wtv(t)  =  {mlm  is  a  register  value  creating  microoperation 
having  an  execution  time  t)  . 

\ 

Then 

Tm-1  Tm  It*  IE  RJ 

P[A  |I=D  =  0]  =  /  f  Pf  Mtv  (i)  1  (4.86) 

t=1  i  =  t  +  1  j=  |jt-1)  •IER+1j 

The  first  summation  of  the  term  inside  the  braces  gives  the 
probability  of  occurrence  of  a  register  value  creating 
microoperation  that  can  cause  an  operand  delay  from  the  t*th 
previous  microinstruction.  The  second  terra  gives  the  probability 
that  a  result  generated  by  the  t*th  previously  issued 
microinstruction  is  referenced.  Equation  (4.86)  can  be  greatly 
simplified  by  approximating  the  first  summation  by  its  average. 
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P[A  11=0  =  0]  =  [  V(Tj^-1)  ] 


T  -1)IER 


=  ILlxJ 


(4.87) 


where  E[tvl  is  the  mean  execution  time  of  value  creating 
microoperations.  The  effect  of  this  approximation  is  to  diminish 
the  probability  of  operand  delay  contributed  by  the  initial 
segment  of  the  operand  dependency  distribution  and  to  increase 
the  contribution  from  the  trailing  segment.  This  is  the  effect 
of  microoperation  reordering  by  the  Window,  Thus  the  error 
introduced  in  equation  4,85  where  it  was  assumed  that 
microoperation  reordering  does  not  affect  o^  is  partially 
compensated. 

The  dependency  of  A  on  the  computation  state  arises  because 
the  operand  delay  to  an  issuing  f-microoperation,  m,  may  be 
caused  by  an  f-function  unit  at  a  time  when  m  is  also  subject  to 
an  f-function  unit  wait.  This  dependency  is  illustrated  by  the 
following  example.  Assume  that  for  an  operation  unit,  only  the 
f-functicn  unit  requires  more  than  one  cycle  for  execution.  Thus 
only  f-function  units  can  cause  both  operand  delays  and  function 
unit  waits.  If  it  is  known  that  no  f-function  units  are 
currently  executing,  then  there  can  be  no  operand  delays. 
Conversely,  if  it  is  known  that  there  is  an  f-function  unit 
executing,  then  the  likelihood  of  operand  delay  increases.  This 
dependency  diminishes  with  increasing  n^ .  The  following 
approximation  will  be  used  to  diminish  the  contribution  to 


N0D(I+D,f)  by  A: 


4-73 


AF  =  1-(P^  E[Y^  ]/n^  )/(1- (ECY^  ]/n^)  )  (4.88) 

The  numerator  of  the  fraction  part  is  the  probability  that  the 
operand  delay  causing  microoperation  is  an  f-microoperation 
coincident  with  the  event  that  an  f-function  unit  is  busy.  The 
denominator  is  the  probability  that  some  f-function  units  will  be 
available  in  the  next  execution  cycle.  The  denominator  serves  to 
condition  the  probability  to  events  where  there  are  no  f-function 
unit  waits.  Thus  P[A]  . AF  will  represent  the  probability  of  A 
given  that  there  is  no  f-function  unit  wait.  Generalizing 
equation  4,87  to  include  this  dependency^ 

P[A  |I,D  ]  =  K  •AF  X-  05  (4.89) 

j  =  Jo 

where  K  =  (ECty  ]-1  )/(T^-1) 

Jo  =  I+D+1 

Jl  =  +  (T^-1)*IEH 

Combining  the  above  results  gives  the  probability  that  a  source 
operand  is  delayed  as 

I>D  Ji 

P[  AvByC  |I,D  ]  =  P[MV]  21  ©5  K*AF  (^.50) 

j=1  j=J, 

Equation  4.90  can  be  used  to  determine  the  probability  that  a 
monadic  f-microopefation  will  not  be  delayed  waiting  for  an 
operand  given  Ir D,  and  f- extensibility. 

N0D(l4D,f)  =  1-(P[Mv  +  K*AF  X-  (4.91) 

3=1  D-‘3o 

For  dyadic  microoperations,  let  and  S^,  be  the  events  that 
operand  one  or  operand  two,  respectively,  are  delayed.  Then 
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P[Si  vS^]  =  1-  P[  tS,  AnS^  1 

=  1-  P[  f-  (4.  92) 

This  result  follows  from  the  assumption  that  the  probability  of 
operand  delay  is  identically  and  independently  distributed. 
Because  S  =AvBvC ,  the  probability  of  no  operand  delay  given  I,D, 
and  f-ext ensibility  for  dyadic  microoperations  is: 

NOD(I  +  D,f)  =  1  -  P[S^vS2,] 

I+D  jIl. 

=  {1  -  P[  MV  1^032-  K*AF>  Oj^^  (4.93) 

3  =  1  j=dQ 

where  the  results  of  equations  4.90  an'^  4.92  are  substituted  into 
equation  4.93. 

R  L 

Register  availability  must  be  guaranteed  if  a  microoperation  is 
to  be  issued.  This  event  is  described  by  the  state  of  the 
registers.  A  register  is  available  if  there  are  no  remaining 
references  to  its  contents;  otherwise  it  is  occuEied  with  a  live 
value  and  cannot  be  reassigned.  In  the  static  environment  that 
occurs  during  microprogram  generation,  register  availability  does 
not  depend  on  D  because  registers  are  assigned  to  microoperation 
results  on  a  demand  basis.  In  the  dynamic  environment  of  an 
adaptive  processor,  the  registers  assigned  to  delayed 
microoperations  must  be  considered  in  some  schemes.  The  analysis 
given  below  derives  the  probability  that  a  register  will  be 
available  for  an  f-microoperation ,  P.{I,D,f). 

An  exact  analysis  has  many  difficulties  to  overcome.  Among 
these  are  precise  descriptions  of  the  working  set  of  micrcprogram 
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variables  which  characterizes  register  need,  the  register 
allocation  policy  used  to  assign  variables  to  registers,  and  the 
strategy  used  to  generate  the  microoperation  stream  from  the 
algorithm  which  describes  the  computation  to  be  performed.  We 
illustrate  with  an  example.  Assume  that  in  figure  4.12. 

1.  each  node  corresponds  to  a  microoperation, 

2.  Each  microoperation  takes  1  cycle  to  execute, 

3,  the  node  number  corresponds  to  the  microoperation  stream 
order,  and 

4,  there  are  sufficient  resources  to  avoid  conflict. 

Graphs  I  and  II  represent  identical  computations,  differing  only 
in  microoperation  ordering.  For  Issuer  scan  rates  of  1  and  2 
microoperations  per  cycle,  table  4,2  shows  the  generated 
microinstructions,  along  with  the  live  register  values. 


Figure  4,  12 


Microcode  Structures 
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- 

Table  4.2 

Live 

Register  Values. 

Observe  that  the  two  graphs  induce  different  sets 
microinstructions  and  live  values,  showing  that  stream  order 
affect  resource  requirements.  Graham  [GHA72]  discusses 
effects  in  the  context  of  '  processor  scheduling  from 
schedules. 


Prom  table  4.2,  we^can  derive  some  useful  parameters  of 
microprogram  execution  that  describe  these  effects.  Let 


P[Mv]  = 
lER  • 

E[l]  = 
Na 


the  probability  a  register  value  creating 
microoperation  occurs 

#  of  microoperations/#  of  microinstructions 
the  expected  lifetime  of  a  register  value 

#  of  live  values/#  of  register  value  creating 
microoperations 

the  average  number  of  occupied  registers 

#  of  live  values/#  of  microinstructions 


P[  Mv ]  is  invariant  for  different  microoperation  orderings  or 
rates.  The  remaining  parameters  are  tabulated  in  table  4.3 


of 

does 

such 

list 

the 


scan 

for 


the  relevant  scan  rates. 
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The  effect  of  stream  order  on  the  execution  parameters  is 
demonstrated  in  figure  4.13,  Structure  IT  has  been  ordered  to 
maximize  parallelism.  This  is  demonstrated  by  the  larger 
register  requirements  and  the  sensitivity  of  lEP  to  the  scan 
rate.  On  the  other  hand,  structure  I  minimizes  register  usage  by 
minimizing  parallelism.  The  lEF  sensitivity  is  much  lower  to 
scan  rate  increments,  but  it  does  finally  converge  on  the  maximum 
lER  for  a  scan  rate  of  seven. 


Structure  I  Structure  II 


SP 

lER 

E[l] 

Na 

lEP 

ECl] 

Na 

1 

1 

1.  9 

1  .3 

1 

3 

2.  1 

2 

1.25 

1.6 

1.4 

1.67 

1.6 

1.8 

3 

1.43 

1.4 

1.4 

2.  5 

1. 

1.75 

4 

1. 67 

1.3 

1  ,5 

2.  5 

1. 

1.75 

5 

2.0 

1.1 

1  .6 

2,5 

1. 

1.75 

6 

2.0 

1.1 

1  .6 

2.5 

1. 

1.75 

7 

2.5 

1 

1.75 

2.  5 

1. 

1.75 

Table  4.3  Execution  Parameters  as  a  Function  of  Scan  Pate, 
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Structure  I  and  II  consist  only  of  monadic  microoperations. 
These  allow  a  much  wider  latitude  in  microoperation  reordering/ 
in  general,  than  a  more  realistic  example  having  a  majority  of 
dyadic  microoperations.  Consequently,  parameter  sensitivity 
should  be  reduced.  This  provides  the  basis  for  one  of  the 
approximations  we  use  in  this  part  of  the  analysis.  We  will 
obtain  the  number  of  active  registers,  ,  from 

=  ECliPCMv]  (4.94) 

and  assume  that  is  independent  of  lEp,  Figure  4.13  indicates 
that  for  the  example  used,  this  is  a  reasonable  assumption. 
Also,  Equation  4. 94  can  be  written  as 

=  tE[l  1/IEE)  (PCMvjlEE) 

This  form  is  recognized  as  Little’s  rule  [  KLE75  ]  that  gives  the 
average  number  of  customers  in  a  queue  at  equilibrium. 

Register  availability  is  determined  from  condition  4  of  the 
control  policy  described  in  4.6.2.  Condition  4 (i)  states  that 
the  destination  of  an  issuing  microoperation  cannot  specify  the 
destination  of  a  microoperation  that  is  issued  or  that  will 
remain  in  execution.  Condition  4  (ii)  prohibits  the  overwriting 
of  destinations  containing  live  variables  as  well  as  those 
covered  by  condition  4  (i) .  From  the  total  of  reserved  registers 
specified  above,  the  number  of  register  values  to  die  must  be 
subtracted.  We  will  assume  that  the  death  rates  of  live 
variables  are  equal  to  the  number  of  register  value  creating 
microoperations  issued: 
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death  rate  =  I.P[Hv] 


(4.95) 


Dsing  Little’s  result-  to  determine  the  number  of  executing 
microoperations  the  expected  number  of  occupied  or  reserved 
registers  given  that  I  microoperations  have  been  issued  is 


F 

E[R]  =  (E[1]+IER  +  2Ie[  Y.  ])  P[MV] 


(4.96) 


In  equation  4.96,  it  is  assumed  that  Register  utilization, 
is  defined  as  E[R]  /Nj^  where  N-f:^  is  the  number  of 
registers.  Dj;^  can  also  be  interpreted  as  the  probability 
that  a  register  contains  a  live  value.  ?^ssuming  that 
registers  are  randomly  assigned  to  values  and  that  register 
occupancy  is  an  independent  event,  the  probability  of 
register  occupancy  can  be  described  by  the  binomial  density. 
Thus  if  is  the  total  number  of  registers  and  if  R  is  the 
random  variable  specifying  the  number  of  occupied  registers. 


(4.  98) 


From  equation  4.98,  the  probability  that  no  registers  are 
available  is 


(4.  99) 


P[R  =  N^1  = 


A  register  dependency  occurs  if  there  are  no  registers  available 
and  if  the  f- micr ooperati on  requires  a  register.  If  p_.^  is  the 
proportion  of  f-micr ooperations  that  create  values,  then 


F  =  1  - 


(4.  100) 
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Termination  Transition  Probabilit ies 

On  a  microprogram mable  proeessor,  the  events  that  determine 
when  a  scan  terminates  are  the  events  that  determine  whan  a 
microinstruction  composition  is  completed.  There  are  only  two 
such  events; 

E1.  There  are  no  active  microoperations  remaining  in  the 
Window,  or 

E2,  The  current  composition  state  is  not  f-extensible  for  all 
f -microoperations 

Consequently,  the*  scan  will  terminate  at  point  (I,D)  with 
probability 

PT[I,D]  =  P[E1  V  E2  I  I,D] 

=  1  -  P[iEl  aiE2  I  I,  D] 

=  1  -  P[-iE1  I  I,D]P[-iE2  I  I,D]  (4.101) 

In  equation  4.101,  it  is  assumed  that  El  and  E2  are  statistically 
independent  events.  Using' the  analysis  that  leads  to  equation 
4.69 

P[nE1  I  I,C]  =  W(I  +  D) 

=  (a.  102) 

Also,  from  equation  4,84 

r 

P[-lS2  I  I,D]  =  1  -TT(1-b,CX(I,f)  )  (4.103) 

f=2. 

Combining  equations  4,101,  4.102,  and  4,103 

p- 

PT[1,D]  =  1  -  (l-p,)^'^^*'  .(1-TT(1-  b,CX(I,f)))  (4,104) 

■f--z 
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Microproqrammable  Control  Hliils 

These  microprogram mable  computers  have  a  format  set, 
that  specifies  the  permissible  combinations  of  microoperations 
that  can  be  initiated  together.  None  of  the  is  maximal.  The 
horizontal  microprogrammable  computers  have  the  same  microcode 
generation  process  and-  issue  policy  as  the  maximal 
microprogrammable  computers.  Consequently,  the  analysis  follows 
the  same  reasoning  used  to  derive  the  scan  space  transition 
probabilities  for  the  maximal  machine.  Only  the  composition 
state  probabilities  will  differ. 

Let  Z  =  {Z^}  1<i<r 

where  Z  =  (z^^  ,z^  , .  . .  ,z^j[, . .  . ,  Zp^  ) 

specifies  the  maximum  number  of  f-control 
fields  allowed  by  format  Zi  (z^^^  <  n^) 

F  =  the  number  of  function  unit  types 

Two  formats  Z5_,Z^  are  independent  if  for  all  f 
(i)  Zrt  /  0  implies  =  0  and 
(ii)  z^^  ^  0  implies  z^j  =  0 

The  set  of  permissible  microinstructions,  MI,  is  defined  by 
MI  =  {XIx^<z_j^  for  all  f  for  some  Zi£C} 

\ 

Ey  replacing  the  definition  of  MI  given  by  equation  4.75  with  the 
generalized  definition  above,  equations  4.75-4.79  may  be  usei  to 
compu  te  CX  (I ,  f )  . 

If  for  all  Z^,  Zj  £  Z,  and  Z^  are  independent,  the 

computation  of  CX(I,f)  proceeds  as  before,  unless  the  binomial 
simplification  leading  to  equation  4.  8C  is  used.-  In  this  case, 
the  counterpart  of  equation  4.80  would  be  conditioned  by  the 
occurrence  of  the  appropriate  format. 
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For  a  microinstruction  format  Z^,  let  be  the  probability 
that  a  microoperation  belonging  to  is  selected.  Then 


p^  (T (i,f) 


where  J’(i,f)  =  ■(  1 

1^ 


if  >  0 
otherwise 


Let  >  0  and  =  k  <  z^.^ 

Then  from  the  binomial  approximation. 


(4. 1  05) 


PCX4  =k|I,Z 


(4.106) 


Using  equation  (4.106),  one  may  obtain  the  binomial  approximation 
using  equation  4,81. 


If  the  format  specifications  are  not  independent,  complexity 
is  introduced  into  the  binomial  simplification.  In  this  case, 
the  computation  using  the  multinomial  form  for  CX(I,f)  is  more 
direct.  One  would  obtain  contributions  to  CX(I,f)  from  Z,  as  in 
the  maximal  format  case.  The  contributions  from  all  possible 
combinations  of  the  othelc  formats  would  be  computed.  However, 
these  contributions  would  not  be  added  if  the  combination  of 
microoperations  was  generated  during  the  processing  of  one  of  the 
previous  formats. 

It  should  be  clear  that  the  number  of  permissible 
microinstructions  on  horizontal  microprogram mable  computers  is 

less  than  the  number  on  an  operationally  equivalent  maximal 

% 

machine.  This  is  a  consequence  of  the  reduced  microinstruction 
format  degree.  The  performance  ratio  will  depend  on  the  set,  Z  . 
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If  the  are  selected  to  include  the  statistically  most  popular 
combinations  for  some  program  structure,  one  can  obtain  sizable 
control  memory  reduction,  yet  maintain  a  reasonable  performance 
ratio • 

• 

4,8  Scan  Space  Models  of  Adaptive  M icroprogrammable  Compu  ters 

This  section  provides  derivations  of  scan  space  models  for 
adaptive  processors  whose  control  policies  were  described  in 
chapter  3.  A*s  much  of  the  groundwork  has  already  been  developed 
in  the  previous  sections  the  derivations  will  be  brief.  Detail 
will  be  given  for  new  features  that  are  included  for  a  given 
adaptive  processor. 

For  adaptive  processors,  the  scan  space  extent  is  limited  by 
the  scan  rate  of  the  Issuer,  so  that  the  maximum  size  of  I+D 
equals  the  scan  rate,  SR.  Furthermore,  to  include  the  effect  of 
the  program  structure  on  parallelism  detection  by  the  Issuer, 
maximum  I  is  reduced  to 

=  SR/I^  '  (4.107) 

where  •  SR  =  scan  rate 

I^  =  average  is^ue  interval  (equation  4,5> 

f 

The  composition  states  for  adaptive  processors,  will  be 
limited  in  degree  to  However,  the  maximal  format 

specification  is  used  to  specify  the  permissible  microoperation 


combinations. 
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4.8.1  The  Microinstruction  Register  Adaptive  Processor 

This  adaptive  processor  has  essentially  the  same  issue  policy 
as  the  maximal  microprogra mmable  computers  except  for  the 
restrictions  stated  above.  The  issue  policy  is  given  in  section 
3.6.1.  In  the  following  parts  of  this  ^section,  the  issue  policy 
is  use"d  to  formulate  probabilistic  events  that  induce  scan  state 
transitions. 


Issue  Transition  Probabilities 

For  -an  f- microoperation  to  be  issued  when  the  scan  process  is 
in  state  (I,D,0)^  all  of  the  following  events  must  occur: 

1.  W  (I+D)  :  {window  position  I+D  must  contain  an  active 

m^icrooperation} 

2.  p^  :  (the  issuing-  microoperation  is  an  f-microopera  tion  |  1 . } 

3.  EXT(I,f)  :  (the  composition  state  and  computation  state 

must  be  f-ext  ensiblej  1 , 2} 

4.  NOD(I+D,f) :  there  are  no  delaying  operand  dependencies  from 

Window  position  I+ D I  1,  2,  3} 

5.  R(I,D,f)  :  (There  will  be  a  destination  register  available] 

1.2, 3, 4}. 

6.  SP  (I  +  D)  :  (I  +  D  <  issuer  scan  rate] 

7.  IM  (I)  :  {I  <  Imax} 

There  is  a  direct  correspondence  between  events  1-5  above  and 
those  for  the  maximal  micropr ogrammable  computers.  Events  6  and 
7  serve  to  limit  the  scan  space  extent.  Thus 


y* 

PI[I+D]  =(w  (I  +  D)  2I1pji  FXT(I,f)  MOD  (I  +  D, 


f  =  2 


for  I  <  I^^j^  and 


f)  R(I,D,f) 
I+D  <  SR 


0 


otherwise 


(4.  108) 
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The  W  (I-t-D)  for  the  adaptive  processor  model  is  based  on  equation  (4,55)  . 
W  (I+D)  =  (1-pp^-"''  (4.109) 

where  p  =  (I^[Wo]  -[  Cp  ])“^ 

E[Cp]  and  E[ ]  can  be  obtained  using  equations  4.28  and  4.53 
respectively . 

$ 

Register  availability  differs  for  the  adaptive  processor 
because  the  destination  availability  check  must  include 
destinations  of  delayed  microoperations.  Their  destinations  have 
already  been  assigned  and  they  increase  the  register  occupancy. 

The  expected  number  of  occupied  registers,  ^[F]#  must  include 
this  occupancy.  From  equation  4.96, 

F 

E[R]  =  (E[l  ]  +  IER+2I^  Yr  ]+D)  F[  Mv  ]  (U.110) 

For  register  value  creating  microoperations,  this  leads  to 
E(I,D,f)  =  1  -  0^^^ 
where  Ui^=  E[R]/N^ 

termination  Transition  Probabilities 

\ 

For  the  microinstruction  register  adaptive  processor,  the 
events  that  terminate  the  scan  process  in  state  (I,D,0)  are 

r 

El.  There  are  no  active  microoperations  remaining  in 
the  Window,  or 

E2.  The  current  composition  state  and  computation  state 
are  not  f-extensible  for  all  f,  or 

E3.  I  >  or 

E4.  I+D  >  SR. 

Events  El  and  E2  correspond  to  El  and  E2  of  the  previous  section. 

E3  and  E4  serve  as  the  absolute  scan  term  in at  ion  events  due  to 
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the  hardware  limitations.  The  termination  transition  probability 
has  the  same  form  as  equation  4.104, 


PT[I, D]  = 


(1.p)I+D+1 


F 

(1-  TT  (1-bxCX(I,f))  ) 
f=2 


for  I+D<SR  and  I<I^^ 
1  otherwise 

where  p  =  (E[  ]  +  E[Cr 


(4.  112) 


4.8.2  Source  Buffered  Adaptive  Processor 

From  the  issue  policy  of  this  adaptive  processor  (section 
3.6.2)  the  differences  from  the  microinstruction  register 
adaptive  processor  are  that  the  wait  for  a  function  unit  is 
replaced  by  a  wait  for  a  control  field  and  the  destination  wait 
event  is  reduced. 

The  wait  for  the  control  field  will  modify  EXT(I,f).  In 
essence,  each  control  field  can  now  buffer  a  microoperation  until 
the  function  unit  becomes  available.  To  derive  EXT(I,f),  we  must 
determine  the  probability  for  ,  a  random  variable  for  the 
number  of  occupied  control  fields  at  the  beginning  of  a  cycle. 
First  we  determine  the  probability  that  a  given  control  field 

t 

will  be  occupied; 


P[ control  field  i  is  occupied] 

=  P[  function  unit  i  remains  busy  and 

an  f -microoperat ion  is  issued  to  control  field  i  ] 

If  we  assume  random  assignments  to  available  control  fields 

(better  strategies  are  easily  implementable)  , 
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PCcontrol  field  i  is  occupied] 

=  P[f unction  unit  i  remains  busy  ]  .P[ f-microoperation  is  issued  to 

control  field  i] 

=  E[Y^]/n^  .  P[Wf(f)]  lER  /n^  (4.113) 

The  first  term  is  given  by  equation  4.73.  The  second  term  is  the 
probability  an  f- microoperation  arrives  and  is  assigned  to  one  of 
n  dogtrol  fields*  Assuming  that  occupancy  of  a  control  field  is 
independen\  of  occupancy  of  other  control  fields. 


where  p  =  (EET(f)  -1)  (P[Mf(f)  ]IER/n^) 

as  obtained  from  equation  4.72  and  4,113 

The  remaining  developments  to  obtain  EXT(I,f)  follow  equations 

4.75-4.83  with  substituting  for  .  Comparing  equations  ^^.73 

with  4.113/  it  can  be  seen  that  the  dependency  of  f-e  xterisibi  lity 

on  the  computation  state  is  greatly  reduced  giving 

EXT(I,f)  =  CX  {I,f)  (4.  115) 

Equation  4.78  or  4.81  could  be  used  as  an  approximation  to 
EXT(I,f). 

\ 

The  destination  wait  event  is  reduced  because  delayed 
microoperations  buffer  their  sources  if  they  are  available.  This 

r 

reduces  the  number  of  remaining  references  to  a  register  value, 
permitting,  on  average,  a  faster  reuse  of  registers  for  result 
deposition.  The  mean  number  of  unavailable  registers,  E[R],  is 
obtained  from; 

F 

E[R]  =  (E[  1  ]  +  IER  +  2;E[  Y,  ]  +  D.  (1“k) )  P[  Mv  ] 


(4.  116) 
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To  obtain  k,  we  assume  that  the  number  of  deaths  caused  by  D 
delayed  microoperations  is  proportional  to  the  number  of  values 
accessible  to  a  delayed  microoperation  divided  by  the  total 
number  of  assigned  registers.  The  values  accessible  to  a  delayed 
microoperation  are  the  currently  live  values  existing  in  a 
register  and  the  lEE  microoperation  values  that  will  be  available 
in  the  next  cycle.  Thus, 

f 

k  =  (E[11  +  IER)/(E[1]-H+D  +  IEB  +  2IE[Yc  ]  (4.117) 

For  register  value  creating  microoperations,  this  leads  to 

E(I,D,f)  =  1  (4.118) 

where  U  =  E[  R]/N^ 

Termination  transition  Probabilities 

The  events  of  scan  termination  for  ±he  source  buffered 
adaptive  processor  corresponds  to  those  for  the  microinstruction 
register  adaptive  processor.  Only  E2  has  a  different 
mathematical  description.  PT[I,D]  has  the  same  form  as  equation 
4.118. 

C  0  I=D=0 

Piri.D]  =  ^  1  -  (1-p)’^*'’‘\l-TT  (1-EXT  (I.  f)  )  ) 

f  =  2 

1  <I+D<SR  and  I<Imax 
1  otherwise 

v_ 

where  p  =  (E[  Wo  ]  +  E[  Cp  ]) 

and  EXT(I,f)  is  obtained  from  equation  4.115 


(a.  119) 
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4.8,3  Source  and  Result  Buffered  £-3^tive  Processors 

There  are  many  variations  of  result  buffering  possible.  One 
could  supply  one  or  many  buffers  at  the  function  unit  outputs  or 
associated  the  buffers  at  work  register  inputs.  In  this  section, 
we  consider  the  example  described  in  section  3.6.3.  Result 
buffering  permits  under  certain  conditions  microope ration  issuing 
despite  the  specified  destination  register  not  being  available. 
Consequently /  register  availability,  R(I,D,f),  will  be  increased 
over  the  source  buffering  only  case. 

The  issue  subpolicy  described  in  section  3.6.3,  permits 

issuing,  with  respect  to  destination  register  assignment,  unless 
either  the  destination  register  is  assigned  to  a  delayed 

microoperation  or  the  register’s  Inhibit  bit  is  on.  To  obtain 
R(I,D,f),  some  register  assignment  policy  must  be  assumed. 
Random  assignment  gives  a  simple,  but  far  from  optimal  allocation 
policy.  It  will  be  used  to  determine  a  lower  bound. 

1  et 

m  be  the  microoperation  currently  being  examined  by 
the  Issuer, 

r  be  the  destination  of  m 

A  be  the  event  that  the  Inhibit  bit  of  r  is  off 

E  be  the  event  that  r^  PtQ)  • 

Then 

R  (I,  D,  f)  =  P[  AaB]  (4.120) 

Event  AaB  can  be  viewed  as  a  combinatorial  problem.  The 
executing  microoperations,  the  I  issued  microoperations  and  D 
delayed  micr cope rati ons  define  a  string  of  destination  register 
specifications.  We  will  assume  the  string  length,  L,  is  given  by 
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F 

L  =  ((IEP+2IECY  ]  +  I  +  D)P[Mv]l  (4.121) 

The  string  characters  may  be  classified  into  three  types 
according  to  the  microoperation  disposition  and  destination 
specification 

i  -  issued  or  executing,  specifying  register  r 
d--  delayed,  specifying  regis-ter  r 
X  -  other. 

For  simplicity,  we  will  assume  the  probability  of  issue  equals 
the  probability  of  delay,  and  that  occurrences  of  i,d,  and  x  are 
statistically  independent.  Then 

P[i]  =  P[d]  =  0.5/Ny^ 

P[x]  =  (Nr^-1)/Nr^  (U.122) 

From  the  problem  description,  the  strings  that  permit  the  issuing 
of  a  microoperation  m  specifying  r  as  the  destination  register 
are  composed  of  L  x*s,  or  a  combination  of  L- 1  x*s  and  one  i. 
Thus 

R(I.D,f)  =  [  (N^-1)/Np,]'-  +1[  (K^-1)/N,^]'-'’(0.5/N^)  (“.123) 

The  remaining  facets  of  the  source  and  result  buffered 
adaptive  processor  are  the  same  as  for  the  soilrce  buffered  only 
case.  EXT(I,f)  would  correspond  to  equation  4,115  and  the 
termination  transition  probability,  PT[I,D],  would  correspond  to 
equation  4.119. 
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4,8,4  Function  Unit 

This  adaptive  processor  is  described  in  section  3.6.4.  In 
this  organization,  the  Issuer  may  be  viewed  as  a  selector  that 
diverts  f-microoperation s  to  their  respective  function  unit 
group.  The  Control  Buffers  now  have  the  responsibility  to 
determine  when  a  microoperation  is  to  initiate.  '  With  operand 
forwarding,  the  initiation  condition  is  now  greatly  reduced. 

The  virtual  function  unit  adaptive  processor  can  realize 
performance  speed-up  over  the  previously  discussed  organizations 
from  two  advantages.  The  first,  reduced  initiation  restrictions, 
results  in  greater  parallelism  in  the  computation  [KELTS].  The 
second  is  a  multiplication  of  the  effective  scan  rate.  This 
phenomenon  is  due  to  the  channelling  of  microoperations  to 
independent  sets  of  virtual  function  units,  each  of  which  can 
independently  transmit  execution  ready  microoperations  to 
function  units.  Because  of  the  limited  seguential  scan  rate  of 
an  Issuer,  such  capability  may  be  needed  to  exploit  the 
parallelism  of  microprograms. 

In  determining  the  performance  of  this  organization,  the 
following  considerations  will  apply,  ^ 

1.  The  scan  space  will  model  a  logical,  aggregate  Issuer 
whose  subparts  consist  of  the  sets  of  f-virtual  function 
units.  The  scan  rate  of  this  Issuer  is 


(4.  1  24) 


f=2  i=1 

where  v^'j^  is  the  number  of  f-virtual  function 
unitsinseti 

is  the  number  of  independent  sets  of 
f-virtual  function  units 
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Each  independent  set  of  virtual  function  units  has  an 

exclusive  set  of  n^^  f-function  '  units  that  execute  the 
transmitted  microoperations, 

2.  The  physical  Issuer  acts  as  an  f-microopera tion 
demultiplexor  that  directs  microoperations  to  their 
respective  class  of  virtual  function  units.  We  will 
assume  that  it  is  not  a  bottleneck. 

3.  The  aggregate  Issuer  issues  according  to  the  GENERATE 
subpolicy  stated  in  section  3.6,4,  Thus  a  microoperation 
is  issued  when  it  is  passed  from  a  virtual  function  unit 
to  an  function  unit.  A  microoperation  is  delayed  when  it 
resides  in  a  virtual  function  unit  but  all  of  its  sources 
are  not  available. 

The  Window  function,  W  (I+D)  ,  as  for  the  other  adaptive 


processor  models,  is  based  on  equation  4.12,  This 
consequence  of  consideration  2,  as  the  physical  Issuer  is 
not  to  impede  microoperation  flow  to  the  virtual  function 
The  virtual  function  units  in  this  case  comprise  the 


is  a 
assumed 
units, 
log ical 


Window  to  the  aggregate  Issuer, 


The  conditions  of  f-ext ensibility  are  the  same  as  those  for 
the  Microinstruction  Register  adaptive  processor.  An  f- 
, microoperation  must  wait  for  an  idle  f-function  unit  and  cannot 
be  issued  if  all  occupied.  Thus  the  EXT(I,f) 

derived  for  the  Maximal  machine  (equation  4.84)  may  be  used. 


Operand  dependency  cannot  be  reduced  by  the  virtual  function 
unit  adaptive  processor,  NOD(I+D,f)  is  the  same  as  that  derived 
for  the  maximal  processor. 


'R(I,D,f)  however,  becomes  uniformly  1  because-  the  assignment 
of  destination  registers  cannot  induce  hazards  on  an  operand 
forwarding  scheme. 
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4.9  Equation  Summaries  and  lEP  Computation  Procedure 


This  section  summarizes  the 
various  m icroprogram mable  control 
symbols  is  given  below.  This 
algorithm  that  computes  the  lER. 
the  equations  for  the  derived 
major  components  in  the  equations 


equations  developed 
units.  P.  glossary 
is  followed  by  an  outli 
Finally,  tables  that  s 
models  and  a  discussio 
are  given. 


for  the 
of  the 
ne  of  an 
umma  ri ze 
n  of  the 


lER 

lER* 

Pb 

Pu 

Pc 

Pfv 

Pti 

%  - 
Mt  (f) 

Mf  t  (t) 

Mv 

EX (m, f ) 

Tm 

E[l] 

Oil  '  ^i2L 

Nr 

Zl 

Ia 

SR 

E[Bal 

1 

n  Ta  1 


=  Instruction 
waits 

=  Instruction 
branch  wait 
=  probability 


Execution  rate  including  branch 
execution  rate,  not  including 
of  a  branch  microoperation 


=  Pc 


=  probability 

of  a 

n  un 

cond 

o 

•rl 

nal 

bran( 

3  P 

=  probability 

of  a 

con 

diti 

onal 

bra 

nch 

=  proportion  o 

f  f- 

micr 

oope 

rati 

on  s 

that 

create 

regi ster 

values 

=  proportion  o 

f  f- 

micr 

cope 

rati 

ons 

that 

are 

mon 

adic 

=  proportion  o 

f  f- 

micr 

oope 

rati 

on  s 

that 

are 

dya 

die 

=  set  of  f -microoperations 

=  set  of  f-microoperat ions  with  execution  time  t 
=  set  of  value  creating  microoperations 

=  execution  time  of  microoperation  m  on  an  f-function  unit, 
=  maximum  execution  time  of  a  microoperation 
=  expected  lifetime  of  a  value 
=  operand  dependency  distribution  for  monadic 
and  dyadic  microoperations  respectively^ 

=  number  of  f-function  units 
=  number  of  work  registers 
=  a  microinstruction  format 
=  average  issue  interval 
=  Issuer  scan  rate  ' 

=  expected  branch  wait  for  a  conventional 
processor 

=  expected  branch  wait  for  an  Adaptive 
processor 

=  expected  length  of  computation  burst  on  a 
conventional  processor 
=  expected  length  of  computation  burst  on  an 
Adaptive  processor^ 

=  number  of  microoperations  in  Window  after  predicate 


Cr 

S 

Mtc  (t) 


evaluation 

number  of  microoperations  in  block  tail 
Control  Memory  Supply  Pate 

set  of  CC  microoperations  whose  execution  time 


v 


i-i 


is  t 

=  number  of  f-virtual 


function  units  in  set  i 
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I 

D 

W(I+D)  = 
EXT(I,f) 

NOD  (I+D, 

R  (IrDrf) 
DEG(Z^)  = 

EET(f)  = 

CX(I,f)  = 
Eftvl  = 
AF 


Or 

PirirD]: 

PT[I,D]: 

PD[I,D1: 


s 

e. 


E 

I 


number  of  issued  microoperations 
number  of  delayed  microoperations 
windov  probability  density  function 
=  probability  that  the  composition  and 
computation  states  are  f-extensible 
f)  =  probability  that  an  f-microoperat ion 
will  not  have  an  operand  delay 

=  probability  that  a  destination  register  will 
be  available 

number  of  microoperation  fields  in  the 
microinstruction  format 

number  of  f-function  units  that  will  remain 
busy  in  the  next  cycle 

weighted  execution  time  of  f-microoperat ions 
according  to  stream  mix  and  f-function  unit 
capability 

probability  the  composition  state  is  f-extensible 
given  I  issued  microoperations 
expected  execution  time  of  a  value  creating 
microoperation 

reduction  of  operand  dependency  by  executing 
microoperations  due  to  conditioning  by 
f-extensibil it y  of  computation  state 
register  utilization 
issue  transition  probability 
termination  transition  probability 
, delay  transition  probability 

probability  a  block  passes  through  Converger 
level  k 

probability  block  first  enters  Converger  level 
number  k 

primary  supply  ratio 

probability  a  block  is  not  entirely  prefetch  in 
Converger  level  number  k 

mean  arrival  rate  of  blocks  into  Active  node 
mean  arrival  rate  of  blocks  into  a  node  at 
Converger  level  k 

number  of  microoperations  in  execution 
mean  length  of  a  register  allocation  string 


f 
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An  outline  of  an  algorithm  to  compute  the  lER  is  given  below. 

It  is  an  iterative  procedure  whose  loop  body  consists  of  two 
major  steps,  step  3  and  step  4, .  Step  3  computes  the  transition 
probabilities  out  of  each  non-terminal  state  in  the  extent  of  the 
scan  space.  Step  4,  computes  the  terminal  state  probabilities 
and  a  refined  estimate  for  lER*.  The  final  step  adjusts  the  lER 
to  include  the  execution  of  branches. 

/♦  Algorithm  to  Compute  TER  */ 

1,  Initialize  data 

Enter  machine  configuration,  program  structure 
Generate  microoperation  -  function  unit  table 

Get  initial  estimate  for  lER^  lER-, 

k<-  1  /*  Iteration  number  V 

2,  Iteration  #  k 

/*  Calculate,  expected  Branch  Wait  time  and  Computation  times  */ 

E[  Be  ]  (or  E[Ba])  <-  equations  from 
E[Tc  ]  (or  E[T^j)  <-  Summaries 

/*  Compute  lEF  */ 

lER^  <-  E[T^]/(E[Te]+F[  Bel)  ; 

/*=  Check  for  convergence  of  lER 

IF(|IERj^  -  IEP,^_^1  <  10“^  )  GO  TO  FINISHED 

f 

3,  Compute  state  transition  probabilities 

/*  Loop  over  Issue  dimension  */ 

/*  TmaX  =  KIN  {SR/(1  +  I^)  ,MAX  {DEG  (Zi)  }} 

DO  for  i=0  to  I^a;,;  ; 

/*  Loop  over  Delay  Dimension  */ 

DO  for  d=0  to  SR-i; 

/*  limit  for  d  determined  by  maximum  scan  rate  */ 

PiCi,d]  <-  corresponding  equations 
PT[ i,d 1  <-  from 

PD[i,d]  <-  summary 
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ENDi: 


END; 

END; 


/♦  determine  maximum  extent  of  d  for  given  i  V 
DMAX(i)  <-  d 

IF  (PlCi^d]  <  10~^)  Go  to  END  i; 


4.  Compute  State  Probabilities,  lEH 

/*  Loop  is  simplified  by  initializing  state  probabilities  */ 
/♦  outside  scan  space  -  P  (*  1 ,  d,  0)  =P  (i  ,-1 , 0)  =  0  ♦/ 

/*  This  is  done  in  the  initialization  step  */ 

P  (0,0,0)  1  ; 

lEHSUW  <-  0; 

NORHSUM  PT  (0,0)  ; 

/♦  Compute  probabilities  using  a  recurrence  based  on 
equation  4*71  */ 

DO  for  i=0  to 

DO  for  d=0  to  DM  AX  (i)  ; 

IF(i  =  0  &  d=0)  GO  TO  NEXT; 

P(i,d,0)  <“  P(i-1,d,0)PI  (i-1  ,d)  +  P  (i ,  d- 1 , 0)  PD  (i  ,d-1 ) 

P(i,d,1)  P(i,d,C)PT(i,d); 

NORMSUM  NORMSUM  +  P(i,d,1) 

lERSDM  <r-  lERSUM  +  i  *  P(i,d,1) 

NEXT:  END; 

END; 

lER*  <r-  lERSDM/NORMSDM 
k  4-  k+1; 

GO  TO  STEP  2; 

5.  /*  Include  executed  branches  into  the  final  lER  ♦/ 

FINISHED:  lER  lER/ (1 -p^ -p^  )  ; 
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The  models  will  be  discussed  by  examining  the  derived  state 

t 

transition  probability  equations.  An  analysis  of  the  sensitivity 

of  the  lER  to  the  model  input  parameters  would  be  very  complex 

because  the  presented  solution  is  not  in  closed  form  and  because 

of  the  intricate  interrelationships  between  the  lER  and  the 

parameters.  Instead,  we  will  examine  the  sensitivities  of  the 

/ 

components  of  the  transition  probabilities  to  the  parameters.  As 
before,  the  most  detailed  analysis  will  be  given  for  horizontal 
microprogrammable  processors.  The  other  classes  will  have  their 
essential  differences  discussed.  The  discussions  will  be  keyed 
to  the  equations  appearing  in  the  equation  summaries  for  the 
classes.  As  the  effect  of  E[  ]  and  E[  ]  (or  I[B^]  and  E*  T^]) 
has  already  been  described  in  section  4.4.2,  the  discussion  will 
concentrate  on  sensitivities  of  the  scan  space  models. 

Several  general  observations  on  the  model  equations  and  the 
computation  process  can  be  made.  Parameter  changes  that  tend  to 
increase  PI[I,D]  will  tend  to  increase  the  lEP  because  the 
weights  to  the  scan  space  state  probabilities  will  increase. 
Also,  changes  that  increase  the  scan  space  I  extent'  will  increase 
the  lER  because  states  with  higher  I  are  included.  Termination 
transition  probabilities  depend  on  two  major  , components  -  W(I  +  D) 
and  EXT(I,E).  Increasing  the  associated  probabilities  of  these 
components  decreases  the  probability  of  termination  transition. 
Consequently,  the  sura  of  PI[ I , D  ]  +  PD[ I , D ]  increases,  and  parameter 
changes  that  increase  EXT(l,f)  and  VJ(I+D)  will  tend  to  increase 


the  lEH. 
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Horizontal  W icrogro^ramma ble  ^ntrol  Units 

PI[I,D]  consists  of  six  functional  components.  Five  are 
grouped  together  and  represent  the  effects  of  a  given  f-class. 
These  effects  are  weighted  by  the  occurrence  incidence,  p^ ,  and 
are  added  together  to  obtain  the  composite  effect  of  the 

4 

operation  unit  resources  on  the  program  structure.  The  dominant 
component  of  PI[I,D]  is  W  (I  +  D)  •  W(I  +  D)  places  a  geometric 
envelope  on  the  issue  transition  probabilities  over  the  scan 
space.  W  (I  +  D)  is  affected  by  the  incidence  of  branches. 

The  main  effect  of  ^  is  to  weight  the  contribution  of  the  I- 
resources  on  performance,  but  it  also  influences  EXT(I,f)  and 
NOD(I+D,f).  If  the  f-resources  are  relatively  underutilized, 
components  EXT(I,f)  and  NOD(I  +  D,f)  will’  be  high.  This  will 
increase  the  relative  weight  of  the  f-component  in  computing 
PI[I,D].  The  relationship  may  be  seen  by  taking  the  partial 
derivative  of  PI[I,D]  with  respect  to  p^ .  The  relationship  is 
fairly  complex  but  it  could  be  used  to  analytically  determine  the 
parameters  of  operation  unit  f-components  that  maximize  the 
contribution  to  PI[I,D]. 

EXT(I,f),  f-extensibility,  is  a  measure  of  the  remaining 
capacity,  or  relative  underutilization  of  the  f-resources.  A.s 
such,  EXT  (I, f)  is  increased  by  speeding-up  f-function  units, 
increasing  n^  ,  increasing  the  number  of  control  fields  )  ,  and 

decreasing  p^  .  Speeding-up  f-function  units  has  the  greatest 
sensitivity.  Reducing  EET(f)  to  1  reduces  E[  ]  to  0.  P.  similar 
effect  can  be  obtained  by  increasing  n^ .  However,  speed-up  also 
increases  N0D(I,f)  by  reducing  operand  waits.  Also,  for  large  I, 


A-98a 


PI[I,D]  ={w(I  +  D)7^  P,  ext  (I,f)NOD  (I+D,f)  R  (I,D,f) 

f=2  ^ 


otherwise 


I  <  MAX  {DEG  (Z  )  } 


R  (I+D)  =  (VP^,) 

=  PCMf  (f)  ]/(1-p^) 

EXT(I,f)  =  (1  -  E[Y^]/n^)CX(I,f) 

(f )  “  1)  lEP.  P[  Mf  (f )  ] 

EET(f)  =  'Zl  ^  Kft  (t)  ]/P[  Mf  (f)  ] 
t=1 


CX(I,f)  =21(1  c  X)  .  .  Pp'=  /  ZI  (=  c  X)p, 


xeMix 


xeMi 


I  +  D  ^  I+D 

NOD(I  +  D,f)=p^^  {1-P[Mv]^o^-|  -K.AfZoM  ^  f'' ^ 


-K. 


Jo  =  I  +  D  +  1 


Jl  -  Jc 


\(Tm-1)IER1 


K  =  (E[t^]-1)/(Tv^-1) 

E[tv]  =  >  t  P[Htv(t)  ] 
t=1 

AF  =  (1-p^E[Y^]/n^)/(1-(EtY^]/n^)  ) 


R  (l,D,f) 


=  1  -  Pf. 


=  (E[  1]+IEE  +  ^  E[  Y^  ])  .  P[MV  ]/Nb^ 

PT[I,D]=|l  -  (l-pj^’-^^'l  (1-TI(1-(1-E[Y,]/n.)CX(I,f))) 

^  f=2 

I  <  MAX  {DEG (Z  ) } 
otherwise 

PD[I,D]  =  1  -  PI[I,D]  -  PT[I,D] 


2CB  ]  =  p  +  R(Z_-'t  P[Mtc  (t)  ]-*5) 

t=1 


E{T  ]  =  1/(IER*p^  ) 

Table  4,3  Equation  Summary  for  Horizontal  M icropr ogrammab le 
Control  Units 


J_ 

^^2-  Oil  } 
j=J  ■ 
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I,  increasing  the  number  of  control  fields  increases  CX(I/f)f  and 
EXT(I,f). 

N0D(I+D,f)  is  most  sensitive  to  the  program  structure.  In 
particular/  the  amount  of  parallelism  in  a  program  will  determine 
the  size  of  NOD(I+D,f).  Greater  program  parallelism  will  provide 
an  operand  referencing  distribution  with  a  greater  expected 
distance  between  a  result  generation  and  a  successor  source 
reference.  (Poor  stream  ordering  however,  can  thwart  program 
parallelism  (section  4.1.5.)).  NOD(I-»-D,f)  depends  on  the 
proportion  of  monadic  and  dyadic  operators.  The  contribution  to 
NOD(I+D/f)  from  dyadic  operators  is  reduced,  as  it  is  basically 
the  square  of  the  contribution  by  the  corresponding  monadic 
operators..  Within  a  dyadic  or  monadic  component  in  NOD(I+D,f), 
there  are  two  summations.  One  derives  an  operand  dependency  due 
to  issued  or  delayed  microoperations,  the  other  one,  due  to 
executing  microoperations.  The  former  summation  is  a  direct 
consequence  of  multiple  issues  per  cycle.  In  general,  the 
multiplier,  P[Mv],  will  be  greater  than  the  multiplier  K.  AF  that 
weights  the  contributions  of  executing  microoperations.  This 
indicates  that  succeeding  issues  in  a  multiple  issue  cycle  become 
more  difficult  for  two  reasons.  The  first  is  due  to  parallelism 
limits.  The  second  is  due  to  an  unavoidable  operand  delay  -  even 
if  the  operation  requires  the  minimum  execution  time,  an 
immediate  successor  must  wait  at  least  one  cycle  before  it  can 
initiate.  NOD(I+D,f)  thus  quantitatively  expresses  the  feelings, 
intuitions  or  experiences  of  several  authors  about  the  relative 
utility  of  multiple  scans  per  cycle  [  FLY66,AST67,  KSL75].  The 
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geometric  reduction  of  W  with  I+D  also  reflects  on  this 

aspect, 

E(I,D,f)  describes  the  availability  of  registers  for  results. 
It  is  sensitive  to.  the  expected  lifetime  of  values,  decreasing 
with  increasing  f[ll.  However,  this  sensitivity  is  reduced  by 
providing  more  registers  and  increasing  In  sections  2,3.2 

and  4.7,  interrelationships  between  program  parallelism  and 
register  reguirements  have  been  described.  H(I,D,f)  along  with 
NOD(I+D,f)  permit  the  quantification  of  this  relationship. 
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Ada  ptive  Processor 


There  are  several  differences  exhibited  by  this  class  of 
processor  from  the  horizontal  micro programmable  processor  class. 
The  scan  rate  is  now  limited  and  issue  will  be  restricted  to 
small  I,  particularly  if  the  program  structure  is  characterized 
by  a  large  average  issue  interval,  .  The  scan’ space  extent 
will  be  reduced,  decreasing  lER. 

The  Window  distribution,  however  is  expected  to  reflect  the 
improved  microoperation  availability  because  of  the  longer 
effective  block  length. 

Register  utilization  is  now  sensitized  to  delayed 
microoperations.  Consequently,  R(I,D,f)  reduces  PI[I,D]  and 
increases  PDfI,D].  A  large  Nj^  reduces  this  sensitivity. 


4-lOla 


PI[IrD]  =■! 


W(I  +  D)>_  p  EXT  (I,f)NOD  (I+D,f)  R  (I,D,f) 
f»2 


0  otherwise 

Bd  +  D)  =  [V  (ECw,  ]4E[  Cr 


I  $  SR/I.  and  I+D  <  SR 


E[  Wo  ]  = 


1-1 


R,  /  (1  +2  Vs) 


k 


^1-1  =  Pi-i  :  ’k=  Pk" 


1-1 


P,  =  (21  Hi-'l')  -1  ^K-1  2"i:t^CK-i') 

i  =  1 

s  =  S.  (1-Py^)  /lER 


eu  =  1/(1  +  2Vs) 


(n-j)IER'P[Btc{j)  ]  7  b 

n=2  j=1 


k=K„  ^ 


Ko  =  ljn-1)  IEE'4lJ  ;  =  [nIEE'J 


=  P[Wf  (f)  l/(1-p.^) 


EXT(I,f)  =  (1  -  E[T^  ]/n^  )  CX  (I,f) 

E[Y^  ]  =  (EET  (f)  -  1)  lEE.  P[  Hf  (f)  ] 

Tk, 

EET(f)  =  XI  ^  P[  Mft  (t)  ]/P[  Kf  (f)  ] 
t=1 


cx(i,f)  =Z1(I  c  x)p^^..ppV2I  (’  c  X)p^-...p; 


X£MIX 


X6MI 
Jj 


I  +  D  ^  I+p 

NOD(I  +  D,f)=p  {1-P[  Mv  ]XIo-,i  -K.Af2Ioi1)  ^P,-  fl-FTKyl^  o.., 

"  *  ^  j=1 


j=1  j=do 

=  I  +  D  +  1  ;  Ji  =  J,  +  'JT^-1)IE?J 
K  =  (E[tv]-1)/(T^-1) 


Jt 

K  .  A.fX  O- 


^v]  =  2_  ^  ^ 

t=i 

AF  =  (1-p^E[  Y-]/n^)/(1-  (E[Y^  ]/r.J  ) 


E(I,D,f)  =  1  -  p^^  0 


Mb 


SR 


Or  =  P[Bv](E[l]  +  D  +  IEP+  ^E[  Y^  ]) /N^ 

F 

PT[I,D]  =  {  1  -  P(l-P)^"''^*^  (l-X  (1-EXT  {!,£))  ) 

1  I<SR/I^and  I  +  D  < 

M  otherwise 

p  =  (E[w^  ]+E[Cr.])^'' 

PD[I,D]  =  1  -  PI[I,D]  -  PT[I,D] 

T^.^- 1  Tn,  J 

n^r: 

t  =  1  x=t  +  1  3=Jo 

Jo  =  IFF' +1J  ;  =  |{x-t)IER;] 

E[T  ]  =  1/(IEF.'  p^) 

Table  4.4  Equation  Samir, ary  for  Microinstruction  Register 
Adaptive  Processor 


E[B 


Tts/>- 1  Tn'  JI 

]  =  XI  ^  ^2-1  ^ 


Sourc e  Buffered  ?vdaptive  Processor 


The  additional  effect  of  source  buffering  is  to  improve 
R(I,D,f)  and  EXT(I,f)  over  the  unbuffered  case,  A  portion  of  the 
delayed  raicrooperations  will  buffer  available  source  operands  and 
free  up  additional  registers  for  result  deposit.  The 

I 

significance  of  this  improvement  to  R(I,D,f)  will  depend  on  E[l], 
lEP,  and  N- . 

Extensibility  improves  because  issuing  has  been  partially 
decoupled  from  the  computation  state.  For  a  particular  f- 
resource,  this  change  would  depend  on  its  previous  utilization 
and  on  the  replication  number. 
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PI[IrD] 


=  -fw(I  +  D)^ 


p  EXT  (I,f  )NOD  (I+D,f)  E  (I,D,f) 


0  otherwise 

H(I+D)  =  C  1-(E[w„  ]  +  EC  c^.])-'' 

k 

*2  Vs) 


I  <  SR/I^  and  I+D  <  SE 


Vl  "  Pi--,  J  \=  Pk*  (1<K<1-1);  V,  = 

i  =  1 


s  =  S.  (1-p^  )/IER 

=  1/(1+2Vs) 


T, 


E[cr]  ^  (n-j)IER’P[Ktc  (j)  ]2_  b. 

n=2  j  =  1  k=K. 
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p^  =  P[Ef  (f)  ]/(1-Pt^)  - 

EXT(I,f)  =  (1-E[T4:]/n^,P[  Pf  (f)  ].IER/n^)  CX(:,f) 


EC  ] 


(EET  (f)  -1)  IER.P[  -f  (f )  ] 


Tm 

EET  (f)  =21^  PC  t  VPC  Kf  (f)  ] 

t=1 
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Table  4.5  Equation  Sunmary  for  Source  Buffered  Adaptive  Processor 
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Source  and  Pesult  Buffered  Ad  active  Processors 

The  addition  of  result  buffers  to  the  source 
adaptive  processor  alters  only  R(I,D,f)  is 

sensitive  to  I  +  D, 
increasing 


buff  ered 
now  most 


Again, 


this  sensitivity  can  be  reduced  by 
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I+D 
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IIP  ^ 
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j=j/ 


Jo  =  I+D  +  l  ;  Ji  =  J,  ♦  \(_T^-1)IEP] 


Tm 


EC  tv]  =  z_  t  P[Htv  (t)  ] 
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AF  =  {1-p^Er  Y^]/n^)/(1-(P[Y^]/n^)  ) 

R(I,D,f)  =[  (N^-1)  /N^]^  ♦  L  c  (Nk'D/J^r]^'^  (0*5/N^ 

L  =  E[  Y^  ]+I+D)  •  PC  ll 

F 

PT[I,D]=-ri  -  p(l-p)^*^"''  (1-TT  (1-EXT  (I,f))  ) 

f  =  2  I<SR/I^and  I+D  <  SR 

LI  otherwise 

p  =  (ECWol+ECCrir'* 

PD[I,D]  =  1  -  PI[I,D]  -  PT[I,D] 

Inr  1  _JLm  ^ 

ECB  1  =  2l  ty^  (P[Ktc(r)  ]21  b  ) 
t=1  X^+1  j  =  Jo 

Jo  =  \jx-t-1)  lER* +lj  ;  J^  =  [(x-t)IER*^ 

E[  T  ]  =  1/(IEE’  p^  ) 

Table  U.6  Equation  Summary  for  Source  and  Result  Buffered 
Adaptive  Processor 
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Virtual  Function  Unit  ^da2tive  Processor 

This  organization  provides  a  higher  scan  rate  and  the  maximum 
R(I,D,f).  The  higher  scan  rate  is  proportional  to  the  number  of 
virtual  function  units  available,  and  a  corresponding  increase  in 
scan  space  extent  will  occur. 
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4,10  on  of  the  ll2^2l§ 

In  this  section,  the  issue  delay  model  described  in  section 
4,2  is  compared  with  the  Window  and  scan  space  models  developed 
in  sections  4,4  through  4,8,4,  Afterward,  a  discussion  of  the 
effectiveness  of  lookahead  features,  operand  buffering  and 
forwarding  will  be  discussed. 

Both  the  issue  delay  model  and  scan  space  model  determine  the 
lEB  of  basic  processor  classes.  Thus  they  are  directly 
applicable  to  processors  executing  primitive  or  low  level 
operations,  such  as  microoperations.  They  can  be  considered  as 
studies  in  simplicity  or  refinement.  The  issue  delay  model  needs 
only  a  direct  computation  or  a  simple  iteration.  The  scan  space 
model  involves  a  complicated  iteration  in  which  scan  space  state 
probabilities  are  computed  in  each  iteration.  The  issue  delay 
model  effectively  combines  or  lumps  model  parameters  which  the 
scan  space  model  considers  in  detail.  The  issue  delay  model 
combines  the  effects  of  the  prevision  policy  into  an  average 
instruction  fetch  delay  based  on  branch  waits.  The  scan  space 
model  considers  the  effects  of  provision  policies  for  those 
processors  using  microoperation  prefetch.  It  also  includes  the 
effects  of  conditioning  schemes.  The  issue  delay  model  combines 
the  effects  of  data  hazards  into  a  single  probability 
distribution,  r^  .  The  scan  space  model  separates  these  out. 
This  is  necessary  to  evaluate  the  effects  of  data  buffers.  The 
scan  space  model  also  considers  the  effects  of  microoperation 
buffering.  Finally,  the  issue  delay  model  computes  the  mean  time 
between  sequential  issues.  Consequently,  the  issue  delay  model 
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cannot  directly  model  multiple  scans.  The  scan  space  model 
computes  the  mean  number  of  issues  per  cycle.  Thus  it  can 
directly  model  such  parallel  organizations  as  horizontal 
microprogrammable  processors. 

The  processors  for  which  scan  space  models  were  derived  all 
have  the  same  basic  formulas  and  use  the  iterative  algorithm 
described  in  section  4.9.  The  differences  between  conventional 
processors  and  adaptive  processors  arise  from  the  static  vs 
dynamic  mode  of  microinstruction  composition.  There  are 
advantages  and  disadvantages  to  both  modes. 

Conventional  processors  have  the  advantage  of  an  infinite 
scan  rate.  Thus  the  scan  space  is  unrestricted  in  the  D 
dimension  and  is  restricted  in  the  I  -dimension  by  the  maximum 
degree  format.  Adaptive  processor  scan  space  is  restricted  by 
the  scan  rate  and  by  the  average  issue  interval,  .  However, 
virtual  function  unit  organizations  can  raise  the  effective  scan 
rate.  In  addition,  the  permissible  microoperation  combinations 
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pessimistic  estimates.  Buffering  of  data  has  been  shown  to 
increase  Pv(T/f)  over  the  non-buffered  cases.  Buffering  also 
improves  f-extensibility  over  the  conventional  processors.  This 
is  demonstrated  by  the  increased  EXT(I,f)  term  for  the  buffered 
cases. 

The  virtual  function  unit  organization  has  nb  destination 
register  restriction  and  its  E(I,f)  is  1.  The  f-extensibility  is 
identical  to  that  of  the  maximal  processor. 

In  the  branch  wait  region,  however,  the  adaptive 
organizations  would  appear  to  have  a  sizable  advantage  because  of 
the  stream  controller’s  ability  to  dynamically  react  to  branches. 
The  conventional  organizations  are  subject  to  more  frequent 
interruptions  because  their  block  length  is  shortened  by 
unconditional  branches.  The  degree  to  which  adaptive 
organizations  can  overcome  stream  turbulence  depends  on  the 
supply  rate  of  the  Control  Memory  organization  and  on  the  number 
of  Converger  levels. 

In  the  above  discussions  of  the  processor  models,  the  effects 
of  various  computer  features  on  processor  performance  has  been 
discussed.  The  model  permits  a  quantitative  evaluation  of  the 
capability  of  some  hardware  features  to  cope  with  the  processing 
of  programs  with  certain  characteristics.  In  particular  the 
effect  of  branches  on  processors  is  captured  by  W(I-«-D), 
Adaptive  processors  have  been  shown  to  be  effective  in  this  area, 
but  with  high  hardware  cost.  Program  parallelism  is  jointly 
characterized  by  NOD(I  +  D,f)  and  E(I,D,f).  Changing  the  structure 
of  a  program  to  increase  N0D(I+D,f)  increases  the  demand  on 
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registers.  This  demand  can  be  alleviated  by  software  -  careful 
register  assignments  or  by  providing  source  and  destination 
buffers  or  by  increasing  the  number  of  work  registers.  These 
options  can  be  guantitati vely  evaluated  using  the  scan  space 
model.  In  addition,  the  utility  of  multiple  issues  per  cycle  can 
be  compared  to  the  utility  of  faster,  serial  scans  for  given 

i 

program  structures. 

The  effect  of  restrictive  program  structures  on  look-ahead 
processors  can  also  be  illustrated  using  the  scan  space  models. 
The  effects  of  programs  with  a  high  branch  incidence  has  already 
teen  described.  Sequential  computations  will  concentrate  the 
probability  in  the  operand  dependency  distribution  o^  on  small  i. 
This  will  greatly  reduce  K0D(I+D,f).  In  this  case  a  hardware 
speed-up  is  the  only  possible  remedy  for  a  higher  lEP. 

The  scan  space  model  has  unified  many  aspects  of  program  and 
hardware  that  relate  to  high  performance  computer  organizations. 
A  complex  and  detailed  modeling  capability  has  been  provided  to 
reflect  the  intricate  relationships  of  program,  execution  on 
parallel  processors.  Further  research  is  required  to  validate 
the  model  and  to  possibly  reduce  its  computational  complexity. 


5-1 


CHAPTER  5 

CONCLOSION 

This  thesis  represents  a  compreaensi ve  study  of  the 
computation  process  on  low-level  hardware.  The  problems  and 
complexity  in  effectively  generating  an  optimized  microprogram 
have  suggested  that  a  hardware  approach  might  provide  an 
attractive  alternative  to  microprogram  optimizations.  ks  a 
result,  the  adaptive  processor  was  developed  and  several 
variations  were  presented.  Given  that  several  different 
organizations  can  accomplish  a  given  task.  a  need  naturally 
arises  for  evaluation  techniques  that  can  assess  the  relative 
utilities  of  ,these  organizations.  Consequently,  the  Window  and 
scan  space  models  were  developed  to  analytically  interrelate  the 
features  of  processors  and  programs  and  quantify  the  performance 
of  a  processor. 

The  adaptive  processor  is  an  interesting  high  performance 
technique  that  dynamically  coordinates  low-level  operations  to 
rapidly  effect  a  computation.  Because  of  the  time  scale  in  which 

microoperations  are  executed,  lengthy  waits  for  microoperations 

♦ 

are  intolerable.  As  a  result,  a-  large  fraction  of  the  adaptive 
processor  hardware  is  deployed  to  quickly  respond  to  branches. 
The  unique  hardware  features  used  in  this  task  are  a  Converger 
that  buffers  several  conditionally  prefetched  levels  of 
operations.  Conditioning  schemes  that  speed-up  the  resolution  of 
branch  predicates,  and  an  Issuer  that  composes  three  conditional 
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microinstructions.  All  of  these  are  necessary  to  maintain  a  high 
performance  level. 

The  adaptive  processor  organization  can  provide  performance 
superior  to  that  of  an  operationally  equivalent  microprog ramm able 
processor  given  a  sufficient  microoperation  supply  rate  and  a 
Converger  with  a  _sufficient  number  of  levels.  This  is  a  result 
of  the  increased  block  length  and  residual  microoperations 
provided  by  the  stream  controller,  and  the  ability  to  buffer 
source  and  result  operands.  Also,  the  adaptive  processor 
organization  is  less  sensitive  to  changes  in  number  and  speed  of 
function  units  than  the  conventional  microprog ramm able 
processors. 

Several  variations  of  adaptive  processors  that  use  source  and 
result  buffering  and  operand  forwarding  were  examined,  along  with 
their  control  policies  for  regulating  the  execution  of 
microoperations.  These  examples  demonstrate  a  spectrum  of 
possibilities  that  can  be  used  to  distribute  microoperation 
hazard  avoidence  responsibilities  between  control  units  and 
operation  units . 

The  Window  and  scan  space  models  presented  in  chapter  4  were 
developed  to  quantitatively  evaluate  the  features  included  in 
adaptive  processors  and  other  high  performance,  single¬ 
instruction  stream,  single  data  stream  processors.  They  serve  to 
provide  a  unified  treatment  of  the  interaction  of  program  and 
hardware.  The  models  present  a  complex  computational  process 
requiring  the  iterative  evaluation  of  scan  space  state 
probabilities.  However,  the  computational  components  used  to 
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represent  the  state  transition  probabilities  are  conceptually 

clean  and  simple,  and  they  directly  model  significant  features  of 
program  and  hardware. 

The  models  were  developed  by  examining  the  resource  and 
hazard  avoidence  requirements  of  a  microoperation  that  is  to  be 
executed  and  th-en  developing  probabilistic  events  to  describe 
these  requirements.  For  the  conventional  microprogr ammable  and 
adaptive  processors  modeled,  performance  differences  are 

attributable  to  several  factors.  The  adaptive  processors  have 

superior  stream  routing  capabilities.  This  is  a  consequence  of  a 

longer  block  length  and  the  potential  to  compose 
microinstructions  accross  block  boundaries.  The  conventional 
microprog rammable  processors,  however,  do  not  have  a  scan  rate 
limitation  and  can  have  a  potentially  larger  scan  space  extent, 
depending  on  microinstruction  format  restrictions. 

The  scan  space  models  permit  a  quantitative  evaluation  of 
hardware  used  in  look-ahead  computers  [KEL75].  The  effect  of 
operand  buffering  for  adaptive  processors  is  reflected  in 
FI[I,D],  the  state  issue  transition  probability,  by  increases  in 
the  F(I,D,f)  term  that  provides  a  measure  of  register 
availability,  and  by  increases  in  the  EXT(I,f)  term  that  provides 
a  measure  of  f-function  unit  reserve  capacity.  Virtual  function 
units  improve  performance  by  increasing  the  effective  scan  rate 
of  the  Issuer  and  operand  forwarding  improves  performance  by 
maximizing  the  E(I,D,f)  component  to  1.  None  of  these  hardware 
enhancements  were  able  to  effect  an  improvement  in  NOD(I•^D,f). 
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Improvements  to  NOD(I+D,f)  can  be  effected  only  by  increasing  the 
speed  or  number  of  f-function  units. 
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A.^  Nested  Conditioning 


Nested  Conditioning  probably  has  the  worst  case 
comolexity  of  the  conditioning  schemes  described  in 
3.3.7.  A  description  outlining  an  implementation  is 
here  to  d emonstra-’-e  its  feasibility.  Figure  1  shows 
flow  paths  from  the  generation  of  the  CC  at  a  function 
the  predicate  evaluation  logic. 


hardware 
seo  tion 
incl uded 
the  CC 
unit  to 


The  activities  associated  with  condition  code  processing  are 
geared  to  the  fast  resolution  of  branch  predicates  following  the 
initiation  of  a  microinstruction  with  CC  microoperations.  These 
activities  are  segmented  into  a  pipeline  with  three  stages  as 

discussed  below.  When  a  CC- m icrooperation  is  initiated,  its 

stack  address  is  passed  to  the  TAG  field  of  the  CC  register 

associated  with  the  function  unit  assigned  for  execution.  This 

register  also  has  a  CC  GFN  field  to  temporarily  store  the 

generated  CC  and  a  1  bi^  GFN  field  to  indicate  if  a  CC  is 

available  or  will  be  available  at  ^he  end  of  the  cycle. 

When  a  GFN  is  raised  in  a  cycle,  the  stages  of  the  CC 

processing  pipeline  are  successively  activated.  In  stage  1,  a 
seek  for  the  Active  condition  code  is  made  by  placing  its  stack 
address  on  the  CCEUS,  ID  lines.  This  address  is  obtained  from 
STK,  the  stack  pointer  of  the  Active  arm,  via  the  J  register. 
Each  TAG  comparator  at  the  function  unit  CC  register  file  checks 
for  a  match  and  if  found,  the  corresponding  CCGFN  contents  are 
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Figure  A,1  Nested  Conditioning 


stage  2  then  proceeds  with 
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placed  on  the  CCBUS.CODE  lines, 
evaluating  the  predicate  using  PPEDEVA.L.  One  input  of  PREDEVAL 
is  the  COBUS.CODE  which  conveys  the  outcome  (if  any)  of  stage  1 
and  the  other  input  is  the  C2  field  of  the  Active  BPANCH_INFD 
register.  The  predicate  operation  is  specified  in  the  PPFD  field 
of  BPANCH_INFO.  At  the  end  of  stage  2,  PREDEVAL  raises  either 
the  TROE  or  FALSE  output.  However,  the  raised  value  is  valid 
only  if  both  the  Active  CC  and  Active  branch  were  available.  The 
PPOMOTE_CC  flag,  when  high,  indicates  this  validity.  In  stage 
three,  the  predicate  value,  if  valid,  can  be  used  to  initiate  the 
corresponding  conditional  microinstruction  and  PROMOTE  the 
corresponding  CONVERGER  subtree. 

Note  that  the  pipeline  is  initiated  before  the  CO  is 
available  and  that  the  first  two  stages  are  conditionally 
initiated.  The  predicate  validity  decision  is  deferred  te  the 
beginning  of -the  third  stage  when  the  predicate  value  is  to  be 
used.  This  strategy  allows  the  administrative  duties  associated 
with  the  tagging  of  CCs  to  be  overlapped  within  the  basic  cycle, 
without  incurring  any  execution  time  penalty. 

The  Condition  Code  Processor  diagram  also  shows  the 
information  flow  paths  used  to  store  generated  CCs  in  the 
CCSTACK.  The  CCSTACN  is  a  high  speed  register  file  addressed  by 
CCBUS.TD,  stacking  the  CCs  for  branches  which  have  not  yet  been 
activated.  Each  register  has  a  CCPEAD7  field  that  indicates  if  a 
CC  is  available.  CCSTACKdSTK  is  a  register  containing  the  top- 
of-stack  contents.  The  BR?^NCH_INFC  register  shows  the  possible 
inputs  it  requires.  Upon  a  PROMOTE  of  a  register  subtree,  either 
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the  B5ANCH_INF0 (F)  or  BP ANCH^INFO (T)  are  input,  depending  on  the 
outcome  of  PFEDEVAL.  If  the  Active  branch  has  not  yet  been 
detected,  it  is  passed  directly  from  the  memory  buffers  where 
branch  detection  occurs. 
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*  CSRG-12  THREE  DIMENSIONAL  DATA  DISPLAY  WITH  HIDDEN  LINE  REMOVAL 
Rupert  Bramall,  April  1972  [M.Sc.  Thesis,  DCS,  1971] 


*  CSRG-13  A  SYNTAX  DIRECTED  ERROF  RECOVERY  METHOD 

Lewis  R.  James,  May  1972  [M.Sc.  Thesis,  DCS,  1972  ] 


Abbreviations: 

DCS  -  Department  of  Computer  Science,  University  of  Toronto 
EE  -  Department  of  Electrical  Engineering,  University  of 
Toronto 

*  -  Out  of  print 


CSRG-14  THE  USE  OF  SERVICE  TIME  DISTRIBUTIONS  IN  SCHEDULING 
Kenneth  C.  Sevcik,  May  1972 

[Ph.D.  Thesis,  Committee  on  Information  Sciences, 
University  of  Chicago,  1971;  JA.CM,  January  1974] 

CSRG-15  PROCESS  STRUCTURING 

J.J.  Horning  and  B.  Randell,  June  1972 
[ACM  Computing  Surveys,  March  1973  ] 

CSRG-16  OPTIMAL  PROCESSOR  SCHEDULING  WHEN  SERVICE  TIMES  ARE 
HYPEPEXPONENTIALLY  DISTRIBUTED  AND  PREEMTION  OVERHEAD 
IS  NOT  NEGLIGIBLE 
Kenneth  C.  Sevcik,  June  1972 

[Proceedings  of  the  Symposium  on  Computer-Communication, 
Networks  and  Teletraffic,  Polytechnic  Institute  of 
Brooklyn,  1972] 

♦  CSRG-17  PROGRAMMING  LANGUAGE  TRANSLATION  TECHNIQUES 
W,M.  McKeeman,  July  1972 


CSRG-18  A  COMPARATIVE  ANALYSIS  OF  SEVERAL  DISK  SCHEDULING 
ALGORITHMS 

C.J.M.  Turnbull,  September  1972 

CSRG-19  PROJECT  SUE  AS  A  LEARNING  EXPERIENCE 
K.C.  Sevcik  e^  al,  September  1972 

[Proceedings  AFIPS  Fall  Joint  computer  Conference, 

V.  41,  December  1972  ] 

CSPG-20  A  STUDY  OF  LANGUAGE  DIRECTED  COMPUTER  DESIGN 
David  B.  Wortman,  December  1972 
[Ph.D.  Thesis,  Computer  Science  Department, 

Stanford  University,  197  2] 

CSRG-21  AN  APL  TERMINAL  APPROACH  TO  COMPUTER  MAPPING 

R.  Kvaternik,  December  1972  [M.Sc.  Thesis,  DCS,  1972] 


*  CSRG-22  AN  IMPLEMENTATION  LANGUAGE  FOE  MINICOMPU'l^EFS 

G.  G.  Kalmar,  January  1973  [M.Sc,  Thesis,  DCS,  1972  ] 


CSRG-23  COMPILER  STRUCTURE 

r 

W.M,  McKeeman,  January  1973 

[Proceedings  of  the  USA- Japan  Computer  Conference,  1972] 


*  CSRG-24  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

J.D,  Gannon  (ed.  )  ,  March  1973 


CSRG-25  THE  INVESTIGATION  OF  SERVICE  TIME  DISTRIBUTIONS 

Eleanor  A,  Lester,  April  1  973  [M.Sc.  Thesis,  DCS,  1973  ] 

*  CSRG-26  PSYCHOLOGICAL  COMPLEXITY  OF  COMPUTER  PROGRAMS: 

AN  INITIAL  EXPERIMENT 
Larry  Weissman,  A.ugust  1973 

*  CSRG-27  STRUCTURED  SUBSETS  OF  THE  PL/I  LANGUAGE 

Richard  C.  Holt  and  David  B.  Wortman,  October  1973 


♦  CSRG-28  ON  THE  REDUCED  MATFIX  F EPRE SENTATION  OF  LR(k) 

PARSER  TABLES 
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E,  Czarnik  and  D.  Tsichritzis  (eds.)#  November  1973 

♦  CSRG-30  A  PSEUDO-MACHINE  FOR  CODE  GENERATION 

Henry  John  Pasko,  December  1973  [ M , Sc.  Thesis,  DCS  1973] 

♦  CSRG-31  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 

ENGINEERING 
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CSRG-32  SCHEDULING  MULTIPLE  RESOURCE  COMPUTER  SYSTEMS 

E. D,  Lazowska,  May  1974  [M.Sc.  Thesis,  DCS,  1974] 

♦  CSRG-33  AN  EDUCATIONAL  DATA  BASE  MANAGEMENT  SYSTEM 

F.  Lochovsky  and  D.  Tsichritzis,  May  1974  [INFOR, 
to  appear] 

♦  CSRG-34  ALLOCATING  STORAGE  IN  HIERARCHICAL  DATA  BASES' 

P.  Bernstein  and  D,  Tsichritzis,  May  1974  [Information 
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♦  CSRG-35  ON  IMPLEMENTATION  OF  RELATIONS 

D.  Tsichritzis,  May  1974 

♦  CSFG-36  SIX  PL/I  COMPILERS 

D.  E.  Wortman,  P.J.  Khaiat,  and  D.M.  Lasker,  August  1974 
[Software  Practice  and  Experience,  v.6,  n.3, 

July-Sept.  1976] 

♦  CSPG-37  A  METHODOLOGY  FOE  STUDYING  THE  PSYCHOLOGICAL  COMPLEXITY 
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Laurence  M.  Weissman,  August  1974 

[Ph.D.  Thesis,  DCS,  1974  ] 

♦  CSRG-38  AN  INVESTIGATION  OF  A  NEW  METHOD  OF  CONSTRUCTING 
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David  M.  Lasker,  September  1974  [M.Sc.  Thesis,  DCS,  1974] 

CSRG-39  AN  ALGEBRAIC  MODEL  FOP  STRING  PATTEPNS 

Glenn  F.  Stewart,  September  1974  [M.Sc.  Thesis,  DCS,  1974] 

♦  CSRG-40  EDUCATIONAL  DATA  BASE  SYSTEM  USER’S  MANUAL 

J.  Klebanoff,  F.  Lochovsky,  A.  Rozitis,  and 
D.  Tsichritzis,  September  1974 

♦  CSRG-41  NOTES  FROM  A  WORKSHOP  ON  THE  ATTAINMENT  OF 

RELIABLE  SOFTWARE 

David  E.  Wortman  (ed.),  September  1974 

♦  C3RG-42  THE  PROJECT  SUE  SYSTEM  LANGUAGE  REFERENCE  MANUAL 

B.L.  Clark  and  F.J.E.  Ham,  September  1974 
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A  DATA  BASE  PROCESSOR 

E.A,  Ozkarahan,  S.A.  Schuster  and  K.C.  Smith, 

November  1974  [Proceedings  National  Computer 
Conference  1975,  v.44,  pp. 379-388] 

♦  CSRG-44  MATCHING  PROGRAM  AND  DATA  REPRESENTATION  TO  A 

COMPUTING  ENVIRONMENT 

Eric  C.R.  Hehner,  November  1974  [Ph.D.  Thesis,  DCS,  1974] 

♦  CSRG-45  THREE  APPROACHES  TO  RELIABLE  SOFTWARE;  LANGUAGE 

DESIGN,  DYADIC  SPECIFICATION,  COMPLEMENTARY  SEMANTICS 
J.E,  Donahue,  J. D.  Gannon,  J.V.  Guttag  and 
J.J,  Horning,  December  1974 

CSP.G-46  THE  SYNTHESIS  OF  OPTIMAL  DECISION  TREES  FROM 
DECISION  TABLES 

Helmut  Schumacher,  December  1974 
[M.Sc.  Thesis,  DCS,  1974] 

CSRG-47  LANGUAGE  DESIGN  TO  ENHANCE  PROGRAMMING  RELIABILITY 

John  D.  Gannon,  January  1975  [Ph.D,  Thesis,  DCS,  1975] 

CSRG-48  DETERMINISTIC  LEFT  TO  EIGHT  PARSING 

Christopher  J.M.  Turnbull,  January  1975 
[Ph.D.  Thesis,  EE,  1974] 

♦  CSRG-49  A  NETWORK  FP.AM^iWORK  FOR  RELATIONAL  IMPLEMENTATION 

D.  Tsichritzis,  February  1  975  [in  Data  Base 
Description,  Dongue  and  Nijssen  (eds.) ,  North 
Holland  Publishing  Co, ] 

♦  CSRG-50  A  UNIFIED  APPROACH  TO  FUNCTIONAL  DEPENDENCIES 

AND  RELATIONS 

P.A.  Bernstein,  J.P.  Swenson  and  D.C.  Tsichritzis 
February  1975  [Proceedings  of  the  ACM  SIGMOD  Conference, 
1975  ] 

♦  CSRG-51  ZETA:  A  PROTOTYPE  RELATIONAL  DATA  BASE 

MANAGEMENT  SYSTEM 

M.  Brodie  (ed) •  February  1975  [Proceedings  Pacific 
ACM  Conference,  1975] 

CSRG-52  AUTOMATIC  GENERATION  OF  SYNTAX -REPAIRIN G  AND 
PARAGRAPHING  PARSERS 

David  T,  Barnard,  March  1975  [M.Sc.  Thesis,  DCS,  1975  ] 

♦  CSRG-53  QUERY  EXECUTION  AND  INDEX  SELECTION  FOR  RELATIONAL 

DATA  BASES 

J.H.  Gilles  Farley  and  Stewart  A.  Schuster,  March  1975 

C5RG-54  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER 
PROGRAM  ENGINEERING 

J.V.  Guttag  (ed.)  ,  Third  Edition,  April  1975 

GSRG-55  STRUCTURED  SUBSETS  OF  THE  PL./1  LANGUAGE 

Richard  C,  Holt  and  David  B.  Wortman,  May  1975 


CSBG-56  FEATDEES  OF  A  CONCEPTOAL  SCHEMA 

D,  Tsichritzis,  June  1975  [Proceedings  Very  Large 
Data  Base  Conference,  1975] 

*  CSRG-57  MERLIN:  TOWARDS  AN  IDEAL  PROGRAMMING  LANG9AGE 

Eric  C.R.  Hehner,  July  1975 

CSRG-58  ON  THE  SEMANTICS  OF  THE  RELATIONAL  DATA  MODEL 
Hans  Albrecht  Schmid  and  J.  Richard  Swenson, 

July  1975  [Proceedings  of  the  ACM  SIGMOD  Conference, 

1975  ] 

*  CSRG-59  THE  SPECIFICATION  AND  APPLICATION  TO  PROGRAMMING 

OF  ABSTRACT  DATA  TYPES 

John  V.  Guttag,  September  1975  [Ph.D.  Thesis,  DCS,  1975] 

CSRG-60  NORMALIZATION  AND  FUNCTIONAL  DEPENDENCIES  IN  THE 
RELATIONAL  DATA  BASE  MODEL 
Phillip  Alan  Bernstein,  October  1975 
[Ph.D.  Thesis,  DCS,  1975] 

*  CSRG-61  LSL:  A  LINK  AND  SELECTION  LANGUAGE 

D.  Tsichritzis,  November  1  975  [Proceedings  A.CM 
SIGMOD  Conference,  1976] 

*  CSRG-62  COMPLEMENTARY  DEFINITIONS  OF  PROGRAMMING 

LANGUAGE  SEMANTICS 

James  E.  Donahue,  November  1975 

[Ph.D.  Thesis,  DCS,  1975] 

CSRG-63  AN  EXPERIMENTAL  EVALUATION  OF  CHESS  PLAYING 
HEURISTICS 

Lazio  Sugar,  December  1975  [M.Sc.  Thesis,  DCS,  1975] 

CSRG-64  A  VIRTUAL  MEMORY  SYSTEM  FOR  A  RELATIONAL 
ASSOCIATIVE  PROCESSOR 

S.A.  Schuster,  E.A.  Ozkarahan,  and  K.C.  Smith, 

February  1976  [Proceedings  National  Computer 
Conference  1976,  v.45,  pp. 855-862] 

*  CSRG-65  PERFORMANCE  EVALUATION  OF  A  RELATIONAL 

ASSOCIATIVE  PROCESSOR 

E. A.  Ozkarahan,  S.A.  Schuster,  and  K.C.  Sevcik, 

February  1976  [ACM  Transactions  on  Database 
Systems,  v.  1 ,  n:  4,  December  1976  ] 

CSRG-66  EDITING  COMPUTER  ANIMATED  FILM 
Michael  D.  Tilson,  February  1976 
[M.Sc.  Thesis,  DCS,  1975] 

CSRG-67  A  DIAGRAMMATIC  APPROACH  TC  PROGRAMMING  LANGUAGE 
SEMANTICS 

James  R.  Cordy,  March  1976  [M.Sc.  Thesis,  DCS,  1976] 

*  CSRG-68  A  SYNTHETIC  ENGLISH  QUERY  LANGUAGE  FOR  A 

RELATIONAL  ASSOCIATIVE  PROCESSOR 

L .  Kerschberg ,  E.A.  Ozkarahan,  and  J.E.S.  Pacheco 
April  1976 


CSRG-69  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM  ENGINEERING 

D.  Barnard  and  D.  Thompson  (Eds.),  Fourth  Edition,  May  1976 

♦  CSRG-70  A  TAXONOMY  OF  DATA  MODELS 

L.  Kerschberg,  A.  Klug,  and  D.  Tsichritzis,  May  1976 
[Proceedings  Very  Large  Data  Base  Conference,  1976] 

CSRG-71  OPTIMIZATION  FEATURES  FOR  THE  ARCHITECTURE  OF  A 
DATA  BASE  MACHINE 

E. A.  Ozkarahan  and  K.C.  Sevcik,  May  1976 

*  CSRG-72  THE  RELATIONAL  DATA  BASE  SYSTEM  OMEGA  -  PROGRESS  REPORT 

H.A.  Schmid  (ed. )  ,  P.A.  Bernstein  (ed.),  B.  Arlow, 

R.  Baker  and  S.  Pozgaj,  July  1976 

CSPG-73  AN  ALGORITHMIC  APPROACH  TO  NORMALIZATION  OF 
RELATIONAL  DATA  BASE  SCHEMAS 
P.A.  Bernstein  and  C.  Beeri,  September  1976 

CSRG-74  A  HIGH-LEVEL  MACHIN  E- OR  IE  NT  ED  ASSEMBLER  LANGUAGE 
FOR  A  DATA  BASE  MACHINE 

E.A.  Ozkarahan  and  S.A.  Schuster,  October  1976 

CSRG-75  DO  CONSIDERED  OD:  A  CONTRIBUTION  TO  THE 
PROGRAMMING  CALCULUS 
Eric  C.R.  Hehner,  November  1976 

CSRG-76  "SOFTWARE  HUT":  A  COMPUTER  PROGRAM  ENGINEERING 
PROJECT  IN  THE  FORM  OF  A  GAME 
J.J.  Horning  and  D.B.  Wortman,  November  1976 

CSRG-77  A  SHORT  STUDY  OF  PROGRAM  AND  MEMORY  POLICY  BEHAVIOUR 
G.  Scott  Graham,  January  1977 

CSRG-78  A  PANACHE  OF  DBMS  IDEAS 

D.  Tsichritzis,  February  1977 

CSPG-79  THE  DESIGN  AND  IMPLEMENTATION  OF  AN  ADVANCED  LALR 
PARSE  TABLE  CONSTRUCTOR 

David  H.  Thompson,  A.pril  1977  [M.Sc.  Thesis,  DCS,  1976  ] 

CSRG-80  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM  ENGINEERING 
D.  Barnard  (Ed.),  Fifth  Edition,  May  1977 

CSRG-81  PROGRAMMING  METHODOLOGY:  AN  ANNOTATED  BIBLIOGRAPHY  FOR 
IFIP  WORKING  GROUP  2.3 

Sol  J.  Greenspan  and  J.J.  Horning  (Eds.)  ,  First  Edition, 

May  1977 

CSRG-82  NOTES  ON  EUCLID 

edited  by  W.  David  Elliot  and  David  T.  Barnard, 

August  1977 

CSEG-83  TOPICS  IN  QUEUEING  NETWORK  MODELING 
edited  by  G.  Scott  Graham,  July  1977 


CSRG-84  TOWA-RD  PROGRAM  ILLUSTRATION 

Edward  Yarwood,  September  1977  [M.Sc.  Thesis,  DCS,  1974] 


CSRG-85  CH?^RACTERIZING  SERVICE  TIME  AND  RESPONSE  TIME  DISTRIBUTIONS 
IN  QUEUEING  NETWORK  MODELS  OF  COMPUTER  SYSTEMS 
Edward  D.  Lazowska,  September  1977 
[Ph.D.  Thesis,  DCS,  1977] 

CSRG-86  MEASUREMENTS  OF  COMPUTER  SYSTEMS  FOP 
QUEUEING  NETWORK  MODELS 
Martin  G.  Kienzle,  October  1977 
[M,Sc.  Thesis,  DCS,  1977] 

CSRG-87  *OLGA*  LANGUAGE  REFERENCE  MANUAL 

B.  Abourbih,  H.  Trickey,  D.M.  Lewis,  E.S.  Lee, 

P.I.P,  Boulton,  November  1977 

CSRG-88  USING  A  GRAMMATICAL  FORMALISM  AS  A  PROGRAMMING  lA.NGUAGE 
Brad  A..  Silverberg,  January  1978 
[M.Sc.  Thesis,  DCS,  1978] 

CSRG-89  ON  THE  IMPLEMENTATION  OF  RELATIONS:  A  KEY  TO  EFFICIENCY 
Joachim  W.  Schmidt,  January  1978 

CSRG-90  DATA  BASE  MANAGEMENT  SYSTEM  USEE  PERFORMANCE 
Frederick  H.  Lochovsky,  April  1978 
[Ph.D.  Thesis,  DCS,  1978] 

CSRG-91  SPECIFICATION  AND  VERIFICATION  OF  DATA.  EASE 
SEMANTIC  INTEGRITY 

Michael  Lawrence  Brodie,  April  1978 
[Ph.D.  Thesis,  DCS,  1978] 

CSEG-92  "STRUCTURED  SOUND  SYNTHESIS  PROJECT  (SSSP)  : 

AN  INTRODUCTION" 

by  William  Buxton,  Guy  Ferdorkow,  with 
Ronald  Baecker,  Gustav  Ciamaga,  Leslie  Mezei 
and  K.C.  Smith,  June  1978 

CSRG-93  "A  DEVICE-INDEPENDENT, GENERAL-PURPOSE  GRAPHICS 
SYSTEM  IN  A  MINICOMPUTER  TIME-SHARING 
ENVIRONMENT" 

William  T.  Reeves,  A.ugust  1978 
[M.Sc.  Thesis,  DCS,  1976] 

CSRG-94  "ON  THE  AXIOMATIC  VERIFICATION  OF  CONCURRENT  AXGORITHMS  " 
Christian  Lengauer,  A.ugust  1978 
[M.Sc.  Thesis,  DCS,  1978] 

CSRG-95  PISA:  "A  PROGRAMMING  SYSTEM  FOR  INTERACTIVE 
PRODUCTION  OF  APPLICATION  SOFTWARE" 

Rudolf  Marty,  A.ugust  1978 

CSEG-96  "ADAPTIVE  MICROPROGRAMMING  AND  PROCESSOR  MODELING" 

Walter  G.  Rosocha 
[  Ph.  D. Thesis,  EE,  August  1  978  ] 
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